Coursera UIUC Data Mining notebook

Posted on 2017-05-06 | In Coursera , Data Mining | | Visitors |

stay stuned…
Text Mining and Analytics
Course can be found here
My certificate can be found here
Lecture slides can be found here

Data Visualization

Course can be found here

Lecture slides can be found here

About this course: Learn the general concepts of data mining along with basic methodologies and applications. Then dive into one subfield in data mining: pattern discovery. Learn in-depth concepts, methods, and applications of pattern discovery in data mining. We will also introduce methods for pattern-based classification and some interesting applications of pattern discovery. This course provides you the opportunity to learn skills and content to practice and engage in scalable pattern discovery methods on massive transactional data, discuss pattern evaluation measures, and study methods for mining diverse kinds of patterns, sequential patterns, and sub-graph patterns.

Week 1 Course Orientation

About the Course

Welcome to Data Visualization

Overview

In this module, you will become familiar with the course, your instructor, your classmates, and our learning environment.

Time

This module should take approximately 2 hours of dedicated time to complete.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Read and review the Syllabus and About the Discussion Forums pages.	30 minutes
Complete the Orientation Quiz	15 minutes
Update your Coursera Profile and check out the Social Media page	15 minutes
Post on the Meet and Greet Discussion Board	1 hour

Goals and Objectives

The goal of the orientation module is to familiarize you with the course structure and the online learning environment. The orientation also helps you obtain the technical skills required for the course.

After this module, you should be able to:

Recall important information about this course.
Know your classmates.
Recall how discussion forums operate in the course

Syllabus

Course Description

Data Visualization is a course that teaches how to create visualizations that effectively communicate the meaning behind data to an observer through visual perception. We will learn how a computer displays information using computer graphics, and how the human perceives that information visually. We will also study the forms of data, including quantitative and non-quantitative data, and how they are properly mapped to the elements of a visualization to be perceived well by the observer. We will briefly overview some design elements for effective visualization, though we will not focus on the visual design needed to make attractive and artistic visualizations. This course does not require computer programming, but computer programming can be used to complete the exercises. The course will conclude with the integration of visualization into database and data-mining systems to provide support for decision making, and the effective construction of a visualization dashboard.

Course Goals and Objectives

Upon successful completion of this course, you will be able to:

Describe how 2-D and 3-D computer graphics are used to visualize data.
Describe how an observer perceives and processes information from a visual display.
Utilize a wide vocabulary of visualization methods and how best to apply them to different kinds of data.
Decide which design styles and colors work best for different visualization situations.
Visualize data when it is not numerical.
Use techniques for visualizing databases and data mining to help visually sort through massive datasets.
Analyze tasks and build visualization dashboards to provide data to support making a decision.

Textbook and Readings

There is no required textbook for this class. However the following textbooks may be helpful.

Visualization Analysis and Design by Tamara Munzner

Information Visualization: Perception for Design (3rd Edition) by Colin Ware

Course Outline

The course consists of 4 weekly modules that focus on the following.

Week 1: The Computer and the Human

Key Concepts:

Introduction to visualization
Using computer graphics to display data
The model human processor and Fitts’s law
Human visual perception and cognition
Week 2: Visualization of Numerical Data

Key Concepts:

Different kinds of visualizations and how best to apply them to data
Basic charts such as bar charts and scatter plots
More advanced visualization techniques, such as streamgraphs and parallel coordinates
Some elements of design and color usage
Week 3: Visualization of Non-Numerical Data

Key Concepts:

Graphs, networks, and hierarchies
Layout of relational and hierarchical data, such as treemaps
Methods for visualizing high-dimensional data, such as principal component analysis and multidimensional scaling
Week 4: The Visualization Dashboard

Key Concepts:

Visualizing large datasets
Visualization of databases and data mining results
Visual analytics for decision support
Task analysis
Visualization dashboards

Elements of This Course

The course is comprised of the following elements:

Lecture videos. In each module the concepts you need to know will be presented through a collection of short video lectures. You may stream these videos for playback within the browser by clicking on their titles or download the videos. You may also download the slides that go along with the videos.
Quizzes. Week 1 and Week 4 will include a for-credit quiz. You will be allowed 3 attempts at the quiz per every 8 hours. Each attempt may present a different selection of questions to you. Your best score will be used when calculating your final score in the class. There is no time limit on how long you take to complete each attempt at the quiz.
Programming assignments. There are two required programming assignments for the class. The first programming assignment is to create a visualization of numerical data, and the second programming assignment is to create a visualization of non-numerical data (e.g., a network or a hierarchy). For each assignment, sample data will be provided, but you are encouraged to find your own data. Your goal will be to present that data in a visualization that helps the observer to better understand what the data represents. The programming assignments will be peer graded based on rubrics that measure how well the course’s methods have been applied to the visualization of the data.

Information About Lectures

The lectures in this course contain the most important information you need to know. The following resources accompany each video:

The play button will open the video up in your browser window and stream the lecture to you. The duration of the video (in hours-minutes-seconds format) is also listed. Within the player that appears, you can activate subtitles. English subtitles are available for all videos.
All video lectures have a discussion forum dedicated to them. This is a great place to discuss any questions you have about the content of the video or to share your ideas and responses to the video.

Discussion Forums

The discussion forums are a key element of this course. Be sure to read more about the discussion forums and how you can make the most of them in this class.

How to Pass the Course

To qualify for a Course Certificate, simply start verifying your coursework at the beginning of the course, get an 80% or higher on all quizzes and assignments combined, and pay the fee. Coursera Financial Aid is available to offset the registration cost for learners with demonstrated economic needs. If you have questions about Course Certificates, please see the help topics here.

Also note that this course is in the Data Mining Specialization offered by the University of Illinois at Urbana-Champaign. By earning a Course Certificate in this course, you are on your way toward earning a Specialization Certificate in Data Mining. You may also choose to pre-pay for the entire Specialization, at a discount. See more information about Specialization payments here.

If you choose not to pay the fee, you can still audit the course. You will still be able to view all videos, submit practice quizzes, and view required assessments. Auditing does not include the option to submit required assessments. As such, you will not be able to earn a grade or a Course Certificate.

Getting and Giving Help

You can get/give help via the following means:

Use the Learner Help Center to find information regarding specific technical problems. For example, technical problems would include error messages, difficulty submitting assignments, or problems with video playback. If you cannot find an answer in the documentation, you can also report your problem to the Coursera staff by clicking on the Contact Us! link available on each topic’s page within the Learner Help Center.
Use the Content Issues forum to report errors in lecture video content, assignment questions and answers, assignment grading, text and links on course pages, or the content of other course materials. University of Illinois staff and Community Mentors will monitor this forum and respond to issues.
Note: Due to the large number of learners enrolled in this course, I am not able to answer emails sent directly to my account. Rather, all questions should be reported as described above.

About the Discussion Forums

Expectations

With the large number of learners in this course, no one is expected to read every post made within the discussion forums. Rather, read those that seem interesting to you and reply when you can further the conversation. Above all, you are expected to remain civil and treat all other learners with respect. Failure to do so will result in your removal from the course.

Helpful Tools

Searching for Posts

You can search for a specific key words by using the search box at the top of any individual forum or the All Course Discussions page. It may be helpful to see if your topic or question has already been covered before posting. If so, you can reply to that thread instead of starting a new one. You can also click on someone’s name in the forums (including your own) to see all of the posts that have been made by that person.

Upvoting Posts

When you view any post, you will see an Upvote button under each post. Click the button to draw attention to thoughtful, interesting, or helpful posts. In the course feedback forums, upvoting will ensure that important questions get addressed. The upvoting system is the best way elevate the best posts to be seen by other learners and University of Illinois staff.

Reporting Inappropriate Posts

Please report any posts that are abusive, offensive, that infringe upon copyright, or that otherwise violate Coursera’s Honor Code by using the Report this option found under the menu arrow to the right of each post.

Following

If you find a particular thread to be interesting to you, click the Follow button under the original post of that thread page to receive email notifications when new posts are made.

Improving Your Posts

The forums are your chance to interact with thousands of like-minded individuals on these topics. Getting their attention is one way to do well in this course. In any social interaction, certain rules of etiquette are expected and contribute to more enjoyable and productive communication. The following are tips for interacting in this course via the forums, adapted from guidelines originally compiled by AHA! and Chuq Von Rospach & Gene Spafford:

Search the other posts to see if your topic is already covered. If it is, reply to that thread instead of starting a new one.
Post in the most appropriate forum for your topic, and do not post the same thing in multiple forums.
Use a meaningful title for your thread.
Be civil. If you disagree, explain your position with respect and refrain from any and all personal attacks.
Stay on topic. In particular, don’t change the subject in the middle of an existing thread – just start a new topic.
Make sure you’re understood, even by non-native English speakers. Try to write full sentences, and avoid text-message abbreviations or slang. Be careful when you use humor and sarcasm as these messages are easy to misinterpret.
If asking a question, provide as much information as possible, what you’ve already considered, what you’ve already read, etc.
Cite appropriate references when using someone else’s ideas, thoughts, or words.
Do not use a forum to promote your product, service, or business.
Do not post personal information about other posters in the forum.
Ignore spammers and report them.
For more details, refer to Coursera’s Code of Conduct.

Updating Your Profile

Optionally, please consider updating your profile, which can also be accessed by clicking the Profile link in the menu that appears when you click on the arrow next to your initials at the top-right corner of this screen. When people find you in the forums, they can click on your name to view your complete profile and get to know you more.

Learning takes place not only through direct instruction in excellent courses but also through quality interaction with your peers. Research suggests that creating and/or joining a learning community around your interests and passions helps you stay motivated to learn. You are encouraged to leverage social media platforms to connect with thousands of your peers from across the world. Learn from others, network, share what you learn, create study groups, discuss interesting course topics as well as off-topic ideas, and share your own perspectives about this subject. You can even set up or get notified about physical meetups on these forums so that you don’t miss out any opportunity to interact face-to-face with those who share your interests. Remember that the more active these communities are, the more value they will bring to all. We are also looking for participants in the course to take a leading role in keeping these communities active and of value to all of us, so do join your preferred platform and share!

Connect to your classmates via one or more of the following means:

Google+
Facebook
Twitter
Use #ILLINOISdatavisualization

More Information

If you find another social media page or community related to this course, feel free to share it in the discussion forums.

You may also be interested in liking the following:

NOTE: Do not post links to copyrighted materials in the Coursera discussion forums or on social networks. Doing so is a violation of the Coursera Terms of Service.

Resources

Data Visualization Resources

Below are some links and other resources for tools for scientific visualization. The projects that you will be working on in Week 2 and Week 3 do not require computer programming, and you can use any of the tools below, including the free data visualization websites. Thanks to UIUC Prof. Karrie Karahalios for many of the links. None of these links are in any particular order, so feel free to jump around.

Online Data Visualization Websites

Software

Tableau
SAP Lumira (including a free Personal Edition version)
Microsoft Excel (or any other spreadsheet that includes charts)
ClearStory
BeyondCore
Mathematica
MATLAB
MatPlotLib (if you are comfortable programming Python)
R Programming Language
ggplot2

Some Data Sources

Orientation Activities

[1 point]
1.I am required to read a textbook for this course.

False

True

[1 point]
2.Which of the following activities are required to pass the course? Check all that apply.

Forum assignments

Programming assignments

Quizzes

Reading assignments

[1 point]
3.The following tools will help me use the discussion forums:

“Searching” for your question or a topic before you post a new thread.

“Following” any forums that are particularly interesting to me.

“Upvoting” posts that are thoughtful, interesting, or helpful.

[1 point]
4.If I have a problem in the course I should:

Call the instructor

Drop the class

Report it to the Learner Help Center (if the problem is technical) or to the Content Issues forum (if the problem is an error in the course materials).

Email the instructor

[1 point]
5.This course includes __ weekly modules.

Week 1 Information

Week 1 Overview

Week 1: The Computer and the Human

Overview

During this week’s module, you will explore what data visualization is, as well as various types of data visualization. You’ll also begin to understand how computers display 2-D and 3-D shapes, as well as draw some simple 2-D shapes of your own. In addition to this, you will learn about how humans perceive, learn, and reason about information.

Time

This module should take approximately 5-6 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 1 Video Lectures	2 hours
Read the How the Programming Assignments Work Page	1-2 hours
Week 1 Quiz	30 minutes
Week 1 Discussion Initial Post	30 minutes
Week 1 Discussion Response Posts	30 minutes

Goals and Objectives

Upon successful completion of this module, you will be able to:

Understand what data visualization is and the different kinds of data visualization.
Understand how computer graphics are used to display shapes in 2-D and 3-D, and draw simple 2-D shapes on a web page using Scalable Vector Graphics
Understand how we perceive, learn, and reason about information.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

What is data visualization and how is it used?
How does the computer display information?
How does the user perceive information?

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Interactive visualization, presentation visualization, and interactive storytelling
Scalable Vector Graphics and the difference between how graphics’ shapes are described versus how they are displayed.
Photorealistic rendering and non-photorealistic rendering
The Model Human Processor and Fitts’s law
Lateral inhibition

Tips for Success

To do well this week, I recommend that you do the following:

Review the video lectures a number of times to gain a solid understanding of the key questions and concepts introduced this week.
When possible, provide tips and suggestions to your peers in this class. As a learning community, we can help each other learn and grow. One way of doing this is by helping to address the questions that your peers pose. By engaging with each other, we’ll all learn better.
It’s always a good idea to refer to the video lectures and chapter readings we’ve read during this week and reference them in your responses. When appropriate, critique the information presented.
Take notes while you read the materials and watch the lectures for this week. By taking notes, you are interacting with the material and will find that it is easier to remember and to understand. With your notes, you’ll also find that it’s easier to complete your assignments. So, go ahead, do yourself a favor; take some notes!

Getting and Giving Help

You can get/give help via the following means:

Use the Learner Help Center to find information regarding specific technical problems. For example, technical problems would include error messages, difficulty submitting assignments, or problems with video playback. If you cannot find an answer in the documentation, you can also report your problem to the Coursera staff by clicking on the Contact Us! link available on each topic’s page within the Learner Help Center.
Use the Content Issues forum to report errors in lecture video content, assignment questions and answers, assignment grading, text and links on course pages, or the content of other course materials. University of Illinois staff and community TAs will monitor this forum and respond to issues

Week 1 Introduction

video

0:08

So what is data visualization? Well, data is stored on a computer. But we want to work with the data, and so we need to get the data somehow out of the computer and into our heads. So we can think about the data. So we can understand the data. So we can gain insight into what the data represents.

0:27

And so the best way we’ve found to do that is using visual communication. And so data visualization is a high bandwidth connection between the computer and the human brain that transmits as efficiently as possible the data from the computer and inserts it into our brains.

0:47

And so here’s a sample system, a typical visualization system, that I’ve got diagrammed here. We have a scanner here that’s going to acquire data. In this case it’s going to use a laser beam in order to scan. The shape of this graduate student’s head, and then that data gets processed by the computer into geometry and color, and then it gets displayed on this display screen.
1:14
At this point we set up this visualization, this data visualization channel that’s going to take this data from the display screen. And inserted it into my brain through this perceptual channel. So, we get this visualization operation where I’m looking at the data, and it’s going through my perceptual system, going into my human memory and then I can start to think about it. In this case, there’s some strange things happening with the laser scanning here at the top of the students head. And I want to try to understand why that’s happening.

1:47

And so data visualization isn’t just about seeing data. It’s about understanding data and being able to make decisions based on the data. And so another example of how data visualization has been useful. Is when I discovered just as a university professor.

2:05

So, students apply for graduate school at Computer Science at the University of Illinois, and we have a webpage that they use to select faculty they might want to work with. And so we say, pick out the three faculty members you’d like to work with, and then we have drop down menus with all the faculty. And students pick the faculty they want to work with, and then the students

2:29

having made that choice, that information is entered into a database. And for each student applicant, we can see the faculty they might be interested in working with and we can start to get an idea for each faculty member what students they would want to pick from.

2:46

So, I wanted to plot to see how uniform this was across the faculty, and so I used a data visualization tool in order to do this and I came up with a standard bar chart. And, horizontally, these are basically the different faculty members. There’s some faculty members are in our department, and some are affiliated with our department, and some get more students interested in working with them, and some get fewer students working with them, and so horizontally I’ve got the a faculty members here and vertically I’m plotting the number of student that want to work with that faculty member. And so we’ve got some faculty members that are very popular and some that aren’t very popular and again, some of the one’s that aren’t very popular are faculty that are affiliated with other departments. And students usually apply to work with them through other departments.

3:37

And I’ve eliminated the faculty names here, just to avoid any embarrassment. So we’ve got this list and then I colour code it, the bar charts. This is a stacked bar chart. I’ve colour coded the bar charts based on blue, if the faculty member was a students first choice. Orange if the faculty member was the students second choice, and green if the faculty member was the students third choice.

4:04

And you might see something here, I certainly saw it when I first plotted this data, and that was that students were picking people in the first half of the alphabet. This is in alphabetical order. Students are picking faculty from the first part of the alphabet first, and they’re picking people from the second half of the alphabet mostly last. And then the middle part of the alphabet, or I should say, they’re picking second people uniformly across the alphabet and what I discovered was that from this visualization, was that students were picking faculty in alphabetical order.

4:46

And having made that discovery from that visualization, I was able to work back into the system and figure out why students were picking faculty in alphabetical order. And the reason was we had three pull down lists with all the faculty. And so the students would go down the list and find the first faculty member they were interested in and click there. They’d go down the list and pick the second faculty they were interested in and they would select that, and then they would find the third faculty member they were interested in and click that, and then that would go into the database, but when we were looking at the database, in order to evaluate what students might want to work with us. We were assuming students were picking the faculty they wanted to work with first, as the first choice, and their second choice was second, and their third choice was third. When in fact, they were just finding three faculty members and picking them when they appeared in the list. And so in this case data visualization helped us make a decision that we wouldn’t have been aware of otherwise. And so that’s a hallmark of data visualization is to provide

5:53

insight into the data, and be able to use that insight in order

5:58
to make further decisions, further hypotheses to explore further.
6:04
And so that’s the kind of insight that I want to provide and the tools that I want to provide, to students in this course. And so we’ll work through basically four weeks of materials on data visualization. The first week, we’re just going to understand the platform for data visualization. Basically, how a computer displays data, and then how the user perceives and processes it.

6:31
And so we’ll spend most of the first week just studying the computer and studying the human. The second week, we’ll introduce a few simple charts, bar charts and so on. And more importantly, we’ll look at the kinds of data that we want to display. That we want to visualize, and we’ll make sure. That we’re using the proper chart for the proper kinds of data. And that the proper kinds of data are being mapped to the correct elements of that chart. And then we’ll also talk about how to design charts effectively to convey that data. In week three, we’ll look at how to layout and display other data, non numerical data, relational data. When you’re data isn’t necessarily going to be a number that will give you a height and a bar chart, but is basically a relation that says item A is related to item B.

7:26
How do we display that? How do we visualize that? How do we gain insight into that kind of data? So we spend week three on more looking at abstract data as opposed to numerical data. And then week four, we’ll talk about visualizing data bases, how to visualize the results of data mining. And how to analyze the way we work, so the visualization can provide us the insight into how do we make a decision. And so we’ll talk about how to make a visualization dashboard that gives you the information you need, in order to make an effective decision, and an informed decision.

8:06
So I think it’s useful to talk about what we won’t be able to cover in these four weeks. We’re going to talk about abstract data visualization. I’m not going to be able to talk in much detail about specific tools used for visualization, or visualization tools for scientific visualization or medical visualization, business data, business intelligence systems and so on. There’s a number of tutorials. And manuals for using specific packages and we just won’t have time to go into any one of them, but i’ll try to provide examples across those different applications.

8:46
I’ll be able to provide, provide a few design principles, but I won’t be able to cover a full course on visual design for presentation and so there are good courses on that.

8:59
That you might take a look at, but, we’ll just be looking at some very minimal design principles that are most effective for data visualization.
9:10
And this will be an introduction to the topics of data visualization. And data visualization can be a very deep subject. And I won’t be able to get into the details of visualization. Will be at sort of a high level of the different areas of visualization. But focusing on the important topics of visualization to make any visualization successful, and to work well. I’ll try to provide pointers to tools, but again, not tutorials on how to use those tools, and there’ll be a couple of assignments in

9:50
the course, and programming can be used to complete those assignments and I encourage you to program if you know how to do these assignments, but it’s not required and I’ll provide existing tools that you can use that you don’t need to write a program in order to use to visualize your data in order to complete these assignments and in general to be able to visualize data. And finally, I’d like to thank a few people that have been very helpful with putting this course together. Eric Shaffer is a lecturer here at the University of Illinois. And he’s talked in one of the most recent scientific visualization classes we had and he’s provided some materials for me for this class from that. I’ve very grateful for his help on that. Karrie Karahalios is also a professor of Computer Science here at the University of Illinois and she’s taught social visualization classes, especially visualizing, kind of relational data in social situations and she’s a world renowned expert on that area and has provided me with a lot of resources and website links for tools to help with a variety of visualization methods. Mike Gleicher is a professor, a friend of mine at the University of Wisconsin. We just completed a university wide data visualization course and had some nice material there that he provided me that I’ve also been able to use in this course. Donna Cox is a good friend of mine from the NCSA. Who taught me a lot of what I know about data visualization. And many other presentation visualization examples I’ll be showing are her work from the National Center for Supercomputing Applications here at the University of Illinois. And finally, Pat Hanrahan, professor at Stanford, has been very helpful and provided a lot of good advice and resources for data visualization, and I owe him my gratitude as well. [MUSIC]

How the Programming Assignments Work

The Data Visualization course includes two programming assignments, one during Week 2 and another during Week 3. These programming assignments give you the opportunity to apply your understanding of data visualization to a visualization project. For Programming Assignment 1 (during Week 2), you will create a visualization of quantitative data (data involving numbers), whereas for Programming Assignment 2 (during Week 3), you will create a visualization of non-quantitative data, such as visualizing a network. Using a programming language is not required to complete either project, but it is allowed.

Your programming assignments will be peer graded. During Weeks 2 and 3, I will detail what you are to submit and where you are to submit it. The programming assignments that you submit will then be graded by four or more of your peers. At the same time, you will grade programming assignments submitted by four or more of your peers. Your final score for each graded programming assignment will be the median of all the scores your work receives.

You are required to grade fairly. While we do not manually review every score that is given to a peer, we do scan for trends within the data that suggest further investigation when someone is not giving a thoughtful score. Unfair grading is a violation of the Coursera Terms of Service and may constitute dismissal from the class and disqualification from earning a Course Certificate. Peer grading is an important part of the ecosystem that makes a course with this many participants successful. Keep in mind, however, that you are not only helping others when you grade a peer’s assignment, but you also benefit from this process. Carefully reviewing a peer’s assignment gives you insights into what others are doing and may spark ideas for your own assignments. Furthermore, when you receive feedback from your peers, you will likely gain ideas of how to improve your own assignments.

Instructions for Week 1

The first week is best spent preparing an environment for you to complete these two projects, and so we have a project milestone to set up your data visualization environment. This project milestone is not graded, but it will be very helpful for you to make progress on your project. Delaying work on your project will result in a lot more work for you in the next modules.

If you plan to use computer programming to complete your programming assignments, then you should set up a programming environment that enables you to plot data and display graphics in an image, video, or other medium that can be made available through a webpage URL to others for grading. If you do not plan to use computer programming, then you should select a software/web platform that you will use to construct your data visualization. A good starting point for these software and online web data visualizations is available on the Data Visualization Resources page. If you are aware of additional tools for data visualization, please share them on the Programming Assignment 1 or Programming Assignment 2 forums.

You should also consider what data you would like to visualize. Is there anything in particular about which you might be curious and would like to study? Or is there a subject with which you are already familiar but about which you would like to educate your colleagues? This is an opportunity to peruse the many data repositories freely available on the internet, and to use data visualization to learn and educate. The Data Visualization Resources page includes some links to get you started in your search for interesting data. If you are aware of additional sources of data, or come across an interesting datastore, please share these on the Programming Assignment 1 or Programming Assignment 2 forums.

I want to encourage you to be as creative as possible with the programming assignments. I’ve been amazed at the work my students have submitted in my past visualization classes, and I’m really looking forward to seeing what you come up with.

Lesson 1-1: Introduction

1.1.1. Some Books on Data Visualization

video

1.1.2. Overview of Visualization

video

Lesson 1-2: Graphics, Drawing and Photorealism

Lesson 1-3: Humans and Visualizations

1.3.1. The Human

video

1.3.2. Memory

video

1.3.3. Reasoning

video

1.3.4. The Human Retina

video

1.3.5. Perceiving Two Dimensions

video

1.3.6. Perceiving Perspective

video

Week 1 Activities

Quiz with 10/10 points (100%)

[1 point]

1.Which kind of visualization would you use to try to answer a personal question about your data?

Presentation visualization

Interactive visualization

Interactive storytelling

[1 point]

2.Which kind of visualization would you use to share a discovery about your data with your colleagues in a slide show?

Interactive visualization

Presentation visualization

Interactive storytelling

[1 point]

3.Which kind of visualization would you use to create a web page that allows viewers to see a visualization of data that you prepared, but also allows the viewer to further investigate the data?

Interactive storytelling

Presentation visualization

Interactive visualization

[1 point]

4.In what order does a data visualization graphics pipeline process information?

Pixel processing, then vertex processing, then rasterization

Vertex processing, then pixel processing, then rasterization

Pixel processing, then rasterization, then vertex processing

Rasterization, then pixel processing, then vertex processing

Rasterization, then vertex processing, then pixel processing

Vertex processing, then rasterization, then pixel processing

[1 point]

5.How many items can human working memory (short-term memory) typically hold?

3–7 items

30–70 items

300–700 items

[1 point]

6.A light gray box drawn on top of a dark gray background will make the light gray box appear __.

Brighter

Darker

The same as it appears on a white background

[1 point]

7.When visualizing data, you should keep your eyes focused on one point for the entire duration of the visualization.

False, because your visual system will play tricks on your perception of the data.

True, because your visual system will better detect any changes to datapoints during the visualization.

[1 point]

8.On which of these colors does the human eye have the most difficulty focusing?

Yellow

Blue

Red

Green

[1 point]

9.Which one of the 3-D depth cues below indicates surface orientation?

Stereopsis

Shadowing

Illumination

Occlusion

[1 point]

10.Given a plot of life expectancy based on country and birth year, you look up your country and birth year, find the displayed life expectancy, and conclude you will probably live that long. This is an example of _.

Inductive reasoning

Deductive reasoning

Subductive reasoning

Abductive reasoning

Discussion Prompt

Overview

Be sure you have read the About the Discussion Forums page before beginning this assignment.

Time Estimate

You should spend approximately 1 hour to prepare, perhaps search the web, formulate your thoughts, submit your initial post, and read and respond to your classmates’ posts.

Option 1 – Good Visualization

Background

Review the definition of data visualization and the differences between interactive visualization, presentation visualization, and online storytelling

Question

What web page URL would you point people to as a good example of presentation visualization or interactive storytelling? Which is it and why is it a good example of this type? What about the web page impresses you the most? How does the web page help you understand the data and gain insight into what the data represents?

Option 2 – Optical Illusion

Background

Review your notes on visual perception and the various factors that can lead to the misinterpretation of an image. (In particular, review your lecture notes from or revisit the lectures “The Human Retina,” “Perceiving Two Dimensions,” and “Perceiving Perspective.”)

Question

What is a favorite optical illusion of yours? Can you provide a pointer to an image or video demonstrating it? Or better yet, can you reproduce your own version of it? Why is this an optical illusion? What elements of the human visual perceptual system are causing the effect of the illusion?

Requirements

Compose an initial post that meets the following requirements:

Includes 300-500 words.
Succinctly summarizes your thoughts.
Uses accepted standard American or British English (e.g., capitalization, punctuation, and spelling).

Submission

Note: This is an optional activity.
All participants submit an initial post in the forum for this activity. Simply type your response in the box below.
All participants should also review the posts of their classmates. Click the title of any post to review it and any replies already submitted.
All participants should respond to at least 2 posts of their classmates. Click the Reply to Thread link next to any post to compose a reply.
Participation is optional

Week 2: Visualization of Numerical Data

Week 2 Information

Week 2 Overview 10 min

Overview

During this week’s module, you will learn how to appropriately select chart time and assign data to chart elements, all while learning how to visualize data in the most effective way possible. You’ll also learn how to plot data variables using higher dimensional visualization methods, and apply principles of design and color to make your visualizations compelling, engaging, and effective.

Time

This module should take approximately 4-5 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):
— | —
Activity | Estimated Time Required
Week 2 Video Lectures | 2 hours
Programming Assignment 1 submission | 2-3 hours

Goals and Objectives

Upon successful completion of this module, you will be able to:

Select an appropriate chart time and assign data to appropriate chart elements to visualize data effectively.
Understand basic charts and how their elements imply certain characteristics of the data they display.
Plot more data variables using higher dimensional visualization methods including glyphs, parallel coordinates, and streamgraphs.
Apply principles of design and color to make a data visualization more compelling, engaging, and effective.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

How can I apply these techniques to the data I encounter in my own work?

Additional Readings and Resources

Design:

The Visual Display of Quantitative Information by Edward Tufte

Envisioning Information by Edward Tufte

Visual Explanations: Images and Quantities, Evidence and Narrative by Edward Tufte

Color:

Information Visualization: Perception for Design by Colin Ware

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Data variables: nominal, ordinal, and quantitative; discrete v. continuous; dependent v. independent
The perceptual accuracy of how different chart elements represent data variables
How glyphs represent multiple dimensions of individual data items, how parallel coordinates plot data over many dimensions, and how streamgraphs improve on stacked bar charts
Chartjunk, the data-ink ratio, and other design rules
Hue, saturation, value, and other ways of thinking about color

Tips for Success

To do well this week, I recommend that you do the following:

Review the video lectures a number of times to gain a solid understanding of the key questions and concepts introduced this week.
When possible, provide tips and suggestions to your peers in this class. As a learning community, we can help each other learn and grow. One way of doing this is by helping to address the questions that your peers pose. By engaging with each other, we’ll all learn better.
It’s always a good idea to refer to the video lectures and chapter readings we’ve read during this week and reference them in your responses. When appropriate, critique the information presented.
Take notes while you read the materials and watch the lectures for this week. By taking notes, you are interacting with the material and will find that it is easier to remember and to understand. With your notes, you’ll also find that it’s easier to complete your assignments. So, go ahead, do yourself a favor; take some notes!

Getting and Giving Help

You can get/give help via the following means:

Use the Learner Help Center to find information regarding specific technical problems. For example, technical problems would include error messages, difficulty submitting assignments, or problems with video playback. If you cannot find an answer in the documentation, you can also report your problem to the Coursera staff by clicking on the Contact Us! link available on each topic’s page within the Learner Help Center.
Use the Content Issues forum to report errors in lecture video content, assignment questions and answers, assignment grading, text and links on course pages, or the content of other course materials. University of Illinois staff and community TAs will monitor this forum and respond to issues

Week 2 Introduction 2 min

video

Lesson 2-1: Data, Mapping, and Charts

2.1.1. Data 7 min

video
transcript

2.1.2. Mapping 9 min

video

2.1.3. Charts 9 min

video

Lesson 2-2: Glyphs, Parallel Coordinates, and Stacked Graphs

2.2.1. Glyphs (Part 1) 4 min

video

2.2.1. Glyphs (Part 2) 6 min

video

2.2.2. Parallel Coordinates 8 min

video

2.2.3. Stacked Graphs (Part 1) 5 min

video

2.2.3. Stacked Graphs (Part 2) 6 min

video

http://www.visualisingdata.com/2010/08/making-sense-of-streamgraphs/

Lesson 2-3: Tufte’s Design Rules and Using Color

2.3.1. Tufte’s Design Rules 12 min

video

2.3.2. Using Color 11 min

video
http://colorbrewer2.org/

Week 2 Activities

Programming Assignment 1: Visualize Data Using a Chart 10 min

https://www.coursera.org/learn/datavisualization/supplement/m6KuV/programming-assignment-1-visualize-data-using-a-chart

Overview

This assignment will give you a chance to explore the topics covered in Week 2 of the course by visualizing some data as a chart. The data set we provided deals with world temperatures and comes from NASA. Alternatively you can use any data that you would like to explore. You are not required to use D3.js, but if you would like to, we have provided some helpful resources that will make it easier for you to create a visualization. You are welcome to use the additional resources, especially if you do not want to program to complete this project.

Goals

The goal of this assignment is to give you some experience with handling and deciding how to visualize some data and for you to get a feel for the various aspects of the visualization.

This assignment will also help you to analyze other visualizations and assess their effectiveness.

Time Estimation

This is not a tricky assignment, but the amount of time that it takes will vary based on the tools you use and the amount of customization you would like to do in your chart.

Instructions

Take the data from the GISTEMP site, specifically the data from “Table Data: Global and Hemispheric Monthly Means and Zonal Annual Means.” Alternatively you can use any data that you would like to explore instead.
Parse the data to a suitable format for the tools that you are using – we have provided two files (in JS, TXT, and CSV formats) that contain the data used to make the visualizations here, which is a subset of the data on the GISTEMP site.
Visualize the data in any meaningful way you wish, keeping in mind the requirements of the Programming Assignment 1 Rubric.
Click below to download the .zip file for this programming assignment.
Programming Assignment 1 Data New.zip
If you’re interested, you can also download the original data by clicking below.

Programming Assignment Data - GISTEMP Original.zip
If you need them, there are free-to-use websites that you can use to write and execute D3 code:

Runnable (see also D3.js for examples)
VIDA

Log of Changes to the Data

We realize that we have updated the data provided from our side multiple times, so this a quick log of the changes:

We updated the .zip by adding the CSV and TXT files, in response to a request.
There was a formatting issue (there were commas in incorrect places) in the files that caused confusion. The formatting was changed to become clearer.
It was pointed out that the data from the GISTEMP site did not match the data we had provided on our side. This was because there had been an update to the GISTEMP data, due to a bug which they had found, since the time we created the .zip on our side. We overlooked this update, which is why the data had differed. A small formatting change was also made for clarity. In the second DATA2 files, there are no longer spaces in the longitudinal demarcations. From “24N -90N,” it is now “24N-90N,” and likewise.

Submission

You must upload an image (or a URL to a website displaying an image) of your visualization for peer evaluation.
In addition to your visualizations, please include a paragraph that helps explain your submission. Here are a few questions that your paragraph could answer:

What are your x- and y-axes?
Did you use a subset of the data? If so, what was it?
Are there any particular aspects of your visualization to which you would like to bring attention?
What do you think the data and your visualization shows?

Submit Programming Assignment 1

Evaluation

Your peers will use the Programming Assignment 1 Rubric to evaluate your submissions. You will also evaluate four of your peers’ assignments after you have submitted your assignment. This assignment is worth 15 points.

Evaluate Programming Assignment 1

Q & A

Please post any issues or questions about this assignment to the Programming Assignment 1 Help Forum.

Sample Submissions

Sample Submission 1:

Graph Submission:

Explanation:

This is a sample submission of a visualization of the data from the GISTEMP site. Every line corresponds to a decade, with every point on the line being a year. The y-axis is the deviation from the 1951–1980 mean in 0.01 degrees. The x-axis then goes from the start to end of the decade.

When hovering over any of the dots, there is a tooltip that displays that year along with the particular deviation for that year. The resulting graph shows a clear increasing mean temperature over the decades.

Grading Rubric – Sample 1

Appropriate Chart Selection and Variables	5 points – The data is well represented by the assignment of data to variables.
Design of the Chart	4 points – It is hard to follow crossing lines.

Note: The Contest part of the rubric is up to you.

Sample Comments from Instructor:

The easiest fix to the first chart submission would be to use color (hue) to differentiate the decades (here shown as a nominal or ordinal variable indicating which line corresponds to which decade). Since the lines cross each other so often, it may be a little hard to follow one line closely. It would be a good idea to have each line colored differently to better distinguish between them.

Sample Submission 2:

Graph Submission:

Explanation:

This graph visualizes the GISTEMP data for the Globe and the North and South Hemispheres through all the given years. The blue line describes the data for the Northern Hemisphere, the red for the South Hemisphere, and the black line is for the Globe.

The resulting graph shows an increasing mean global temperature over the years.

Grading Rubric – Sample 2:

Appropriate Chart Selection and Variables	5 points – The data is well represented by the assignment of data to variables.
Design of the Chart	4 points – The lack of a legend makes the explanation necessary and reduces the direct effect of the chart.

Note: The Contest part of the rubric is up to you.

Sample Comments from Instructor:

The graph would probably be better with a legend on the graph itself so it would be easier to tell what each line describes.

Additional Resources for D3

These Sample Submissions used D3.js for the graphs. You are not required to use D3.js, but if you would like to, the following are some helpful resources that will make it easier for you to create a visualization:

Tutorials and examples:

D3 Tutorial Book

D3 – Making a Line Graph

Simple D3 Line Graph

D3 – Line Graph

Programming Assignment 1 Rubric 10 min

Criteria

Appropriate Chart Selection and Variables

Did you select the appropriate chart and use the correct chart elements to visualize the nominal, ordinal, discrete, and continuous variables, as described in lecture 2.1.3? Continuous data variables should be assigned to continuous chart elements (e.g., lines between data points), whereas discrete variables should be assigned to discrete chart elements (e.g., separate bars). Furthermore, the assignment of variables to elements should follow the priorities in lecture 2.1.2.

Design of the Chart

Does the chart effectively display the data, based on the design rules in lecture 2.3.1?

Contest

How interesting is the result? Does this represent an interesting choice of data and/or an interesting way to display the data? For example, was a streamgraph used instead of an ordinary bar chart?

Grading

Criteria	Poor (1–2 points)	Fair (3 points)	Good (4 points)	Great (5 points)
Appropriate Chart Selection and Variables	Chart is indecipherable or significantly misleading because of poor chart type or assignment of variables to elements	Major problem(s) with chart selection or assignment of elements to variables	Minor problem(s) with chart selection or assignment of elements to variables	Chart selection is appropriate for data and its elements properly assigned to appropriate data variables
Design of the Chart	No apparent attention paid to design	Evidence that several of the design rules should have been followed but were not	Evidence that one of the design rules should have been followed but was not	Attention paid to all design rules
Contest	Misleading	Boring	Not boring	Interesting

Peer-graded Assignment: Programming Assignment 12h

Due May 21, 11:59 PM PDT

Instructions

Before submitting your visualization image, make sure you review the full instructions page.

Submit your assignment on the My submission tab. Enter your answers directly in the spaces provided in the My submission tab. You may save a draft of your work as you go, and you can come back later to continue working on your draft. When you are finished working, click the Preview button, verify your identity, and then click Submit for review to submit the assignment.
Review criterialess
Before submitting your visualization image, make sure you review the Programming Assignment 1 Rubric page.

You are required to evaluate the submissions of at least FOUR of your peers based on the instructions and rubric provided. You may begin giving feedback to other students as soon as you submit your assignment; click the Review classmates tab to begin. Feel free to provide additional reviews beyond the four required!

My submission

my submission

Review Your Peers: Programming Assignment 1

Due May 24, 11:59 PM PDT
Ready to review your classmates?
Reviewing another person’s work is a valuable way to learn and providing quality feedback is a useful skill to master. Here are some tips to help you provide the most helpful, accurate, and consistent feedback possible:

Remember, English isn’t everyone’s native language. Focus primarily on how well the work meets the grading criteria.
When writing critical feedback, use a constructive and polite tone.
Everyone enrolled in the course must review at least 4 other submissions to ensure everyone receives a grade; however, many learners complete more to help their classmates who are still waiting. Ready to get started?

Discussion Prompt: Programming Assignment 1 Help Forum5 min

Week 3: Visualization of Non-Numerical Data

John C. Hart

In this week’s module, you will learn how to visualize graphs that depict relationships between data items. You’ll also plot data using coordinates that are not specifically provided by the data set.

Week 3 Information

Week 3 Overview 10 min

Overview

During this week’s module, you will learn how to visualize graphs that depict relationships between data items. You’ll also plot data using coordinates that are not specifically provided by the data set.

Time

This module should take approximately 6-7 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 3 Video Lectures	2 hours
Programming Assignment 1 Peer Evaluation	2 hours
Programming Assignment 2 submission	2-3 hours

Goals and Objectives

Upon successful completion of this module, you will be able to:

Visualize graphs that depict relationships between data items.
Layout data using coordinates that are not explicitly provided by the data.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

Is there any data you encounter that is difficult to plot because it lacks appropriate coordinates?
Is it because the data has too many dimensions?
Or is it because the interesting details are in the data points themselves?
Or is it the relationship between data points?

Additional Readings and Resources

Design:

Graphviz – This page provides tools for graph visualization you can use to complete Programming Assignment 2 without any actual programming if necessary.
Graph Drawing: Algorithms for the Visualization of Graphs – A good but older book on algorithms for graph visualization.

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Graphs, especially different kinds of graphs (e.g., directed v. undirected), ways to represent graphs (e.g., adjacency matrix), and the associated terms (e.g., node degree)

NOTE: Don’t worry if you do not have experience with linear algebra and some of the other concepts used in the slides. Terms like linear system, eigenvector, and optimization describe the tools we use in visualization. While I am trying to provide the details of those tools to the students who may be interested in them, understanding those details is not going to be required to complete the course, and we will provide implementations of those tools that you can simply use as a black box if needed. In any event, it will be good for everyone to be exposed to the math and computation needed to produce good coordinates to plot data when it does not already contain good coordinates.

Social networks and centrality metrics
Treemaps, word clouds, cartograms, and choropleths

Tips for Success

To do well this week, I recommend that you do the following:

Review the video lectures a number of times to gain a solid understanding of the key questions and concepts introduced this week.
When possible, provide tips and suggestions to your peers in this class. As a learning community, we can help each other learn and grow. One way of doing this is by helping to address the questions that your peers pose. By engaging with each other, we’ll all learn better.
It’s always a good idea to refer to the video lectures and chapter readings we’ve read during this week and reference them in your responses. When appropriate, critique the information presented.
Take notes while you read the materials and watch the lectures for this week. - By taking notes, you are interacting with the material and will find that it is easier to remember and to understand. With your notes, you’ll also find that it’s easier to complete your assignments. So, go ahead, do yourself a favor; take some notes!

Getting and Giving Help

You can get/give help via the following means:

Use the Learner Help Center to find information regarding specific technical problems. For example, technical problems would include error messages, difficulty submitting assignments, or problems with video playback. If you cannot find an answer in the documentation, you can also report your problem to the Coursera staff by clicking on the Contact Us! link available on each topic’s page within the Learner Help Center.
Use the Content Issues forum to report errors in lecture video content, assignment questions and answers, assignment grading, text and links on course pages, or the content of other course materials. University of Illinois staff and Community Mentors will monitor this forum and respond to issues.

Week 3 Introduction 1 min

video

Lesson 3-1: Graphs, Graph Visualization, and Tree Maps

3.1.1. Graphs and Networks 8 min

video

3.1.2. Embedding Planar Graphs 11 min

https://www.coursera.org/learn/datavisualization/lecture/OZYbt/3-1-2-embedding-planar-graphs

3.1.3. Graph Visualization 13 min

https://www.coursera.org/learn/datavisualization/lecture/7N1GE/3-1-3-graph-visualization

3.1.4. Tree Maps 9 min

https://www.coursera.org/learn/datavisualization/lecture/9Pj4n/3-1-4-tree-maps

Lesson 3-2: Principal Component Analysis and Multidimensional Scaling

3.2.1. Principal Component Analysis 8 min

https://www.coursera.org/learn/datavisualization/lecture/lvgIt/3-2-1-principal-component-analysis

3.2.2. Multidimensional Scaling 6 min

https://www.coursera.org/learn/datavisualization/lecture/6ZQop/3-2-2-multidimensional-scaling

Lesson 3-3: Packing

3.3.1. Packing 12 min

https://www.coursera.org/learn/datavisualization/lecture/wZ6DH/3-3-1-packing
http://www.mappinghacks.com/data/

Week 3 Activities

Programming Assignment 2: Visualize Network Data 10 min

https://www.coursera.org/learn/datavisualization/supplement/TTbrU/programming-assignment-2-visualize-network-data

Overview

This assignment is meant to give you the opportunity to do non-coordinate data visualization, specifically using a network of your choosing.

Goals

Network data lends itself easily to graph visualization. This assignment will allow you to explore the means of network visualization that you have encountered in the course, to decide how best to visualize data, and to analyze other network visualizations.

Time Estimation

The time this assignment takes will vary greatly and depends a lot on your choice of data set, on the tools you use, and on the customization of your data. Try to leave yourself at least three hours to devote to this assignment.

Instructions

Find some network data that you think is suitable and that you would like to visualize. Here are some sites that provide links to a wide variety of different graph/network datasets:

Choose a visualization platform and parse the data into a format suitable for the tools you will use.

For non-programmers, there are downloadable programs for creating graph visualizations at Graphviz. The program “neato,” which creates a layout for an undirected graph based on multidimensional scaling, is a good place to start. 1. The main challenge with using these tools is converting the graph data into the input text file format used by the tool, and understanding (and experimenting with) the various tool settings.
For programmers, there are graph visualization tools available in D3 for JavaScript, such as force-directed graphs, treemaps, collision detection, and a nice graph drawing tutorial. Feel free to use any other libraries or languages as well.
Visualize the data in a meaningful way, keeping in mind the requirements of the rubric.

Submission

You must upload an image of your visualization for peer evaluation.
In addition to your visualization, please include a paragraph that helps explain your submission. A few questions that your paragraph could answer include:

What is the data set that you chose? Why?
Did you use a subset of the data? If so, what was it?
Are there any particular aspects of your visualization to which you would like to bring attention?
What do you think the data and your visualization show?

Submit Programming Assignment 2

Evaluation

Your peers will use the Programming Assignment 2 Rubric to evaluate your submissions. You will also evaluate four of your peers’ assignments after you have submitted your assignment. This assignment is worth 15 points.

Evaluate Programming Assignment 2

Q & A

Please post any issues or questions about this assignment to the Programming Assignment 2 Help Forum.

Sample Submissions

Sample Submission 1:

Explanation:

This is a graph visualization representing the most commonly occurring adjectives and nouns in the novel David Copperfield by Charles Dickens. The nodes are both adjectives and nouns, and the edges connect words that appear in an adjacent position in a book. Hovering over any node brings up the data associated with that node.

The size of the node is determined by the number of edges it has, and the positions are determined by a force-directed layout algorithm.

Grading Rubric

Proximate Layout	5 points – Related data is connected properly, and the varying sizes helps to properly separate data.
Design of the Visualization	4 points – Only one problem with the use of color.

Note: The Contest part of the rubric is up to you.

Sample Comments from Instructor:

Even though there are many edge overlaps, this is a large non-planar graph. The items are well distributed and the edges are easy to follow visually through the overlaps.

The frequency of words is nicely mapped to the area. The colors are assigned randomly. While the random colors help differentiate the nodes, there is no differentiation between adjectives and nouns, and so the visualization misses the opportunity to easily incorporate this dimension in its use of color at the nodes.

Sample Submission 2:

One way to create a graph from ordinary data is to create an edge between items and the categories they belong to. For example, this is a map of the CS faculty at UIUC. We had a list of faculty and the areas of CS they worked in. For example:

…

Han: database, mining

Hart: graphics

…

Zhai: database, mining, bio, ML

Using a dot file (for the “dot” layout program available from Graphviz) that assigned each faculty member and each research area to a node, we then made edges from each faculty member to the research areas they had listed. The result was the attached map of the CS department that clustered faculty by the areas they worked in.

Grading Rubric

Proximate Layout	4 points – Unnecessary and distracting overlaps of the edges and nodes
Design of the Visualization	4 points – No differentiation between faculty and areas.

Note: The Contest part of the rubric is up to you.

Sample Comments from Instructor:

Even though the instructor made this graph, it is far from perfect. While the edges do not overlap each other, the edges do overlap the nodes, which makes it difficult to see. For a graph with this few number of elements, one could adjust the placement by hand, perhaps using the free vector graphics editing utility “inkscape.” Also, the graph does not differentiate between faculty and areas, but could with node shapes and/or colors.

Programming Assignment 2 Rubric 10 min

Proximate Layout

How well are related items placed near each other? Do the edges cross or do items overlap when perhaps they do not need to? Are the crossings distracting?

Design of the Visualization

Does the visualization effectively utilize the assignment of variables to elements and design of a visualization described in Week 2?

Contest

How interesting is the result? Does this represent an interesting choice of data and/or an interesting way to display the data?

Grading

Criteria	Poor (1–2 points)	Fair (3 points)	Good (4 points)	Great (5 points)
Proximate Layout	Relationship between items cannot be discerned because of poor layout.	Major problems with the layout, leading to many long edges and/or overlaps that distract from the data.	Minor problems with the layout, resulting in long edges or unnecessary overlaps in objects or edges.	Related items are placed near each other and intersections of visualization elements are not unnecessarily distracting.
Design of the Visualization	Relationship between items cannot be discerned because of poor element and/or design choices.	Major problems with some elements and/or design choices that interfere with the display of the data.	Minor problems with some elements and/or design choices that distract from the display of the data.	Visualization effectively uses elements and design to display the data.
Contest	Misleading	Boring	Not boring	Interesting

Peer-graded Assignment: Programming Assignment 2 2h

Due May 28, 11:59 PM PDT

Install graphviz with python

https://goo.gl/8Fz9PY
open the command window with Win + R and type pip install graphviz
it appears

Collecting graphviz
C:\Python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py
:318: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Na
me Indication) extension to TLS is not available on this platform. This may caus
e the server to present an incorrect TLS certificate, which can cause validation
 failures. You can upgrade to a newer version of Python to solve this. For more
information, see https://urllib3.readthedocs.io/en/latest/security.html#snimissi
ngwarning.
  SNIMissingWarning
C:\Python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py
:122: InsecurePlatformWarning: A true SSLContext object is not available. This p
revents urllib3 from configuring SSL appropriately and may cause certain SSL con
nections to fail. You can upgrade to a newer version of Python to solve this. Fo
r more information, see https://urllib3.readthedocs.io/en/latest/security.html#i
nsecureplatformwarning.
  InsecurePlatformWarning
  Downloading graphviz-0.7-py2.py3-none-any.whl
Installing collected packages: graphviz
Successfully installed graphviz-0.7
C:\Python27\lib\site-packages\pip\_vendor\requests\packages\urllib3\util\ssl_.py
:122: InsecurePlatformWarning: A true SSLContext object is not available. This p
revents urllib3 from configuring SSL appropriately and may cause certain SSL con
nections to fail. You can upgrade to a newer version of Python to solve this. Fo
r more information, see https://urllib3.readthedocs.io/en/latest/security.html#i
nsecureplatformwarning.
  InsecurePlatformWarning

using gvedit.exe

d3 display with local server

download gml2json.py from this website
download graph data from UCI data
open cmd in your file directory and type python gml2json.py dolphins.gml >> dolphinsdata.json to save information to dolphinsdata.json file
download index.html and .json file from Force-Directed Graph and modify to your style
open cmd in your file directory and type python -m SimpleHTTPServer and you can see your svg in http://localhost:8000/index.html

D3 display with website

log in this website runnable.com
play it in this website
my account http://code.runnable.com/u/SSQ

Review Your Peers: Programming Assignment 2

Due May 31, 11:59 PM PDT

Discussion Prompt: Programming Assignment 2 Help Forum 5 min

https://www.coursera.org/learn/datavisualization/discussions/weeks/3/threads/Mq4-EDvHEeew7Q7A7m3f-g

Week 4: The Visualization Dashboard

In this week’s module, you will start to put together everything you’ve learned by designing your own visualization system for large datasets and dashboards. You’ll create and interpret the visualization you created from your data set, and you’ll also apply techniques from user-interface design to create an effective visualization system.

Week 4 Information

Week 4 Overview 10 min

Overview

During this week’s module, you will begin to design a visualization system for a large data set. After creating an initial overview of the data set, you’ll search for interesting elements in that data set using techniques such as zooming and filtering. Finally, you’ll apply techniques you learned from user-interface design to create an effective visualization system.

Time

This module should take approximately 5 hours of dedicated time to complete, with its videos and assignments.

Activities

The lessons and assignments for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 4 Video Lectures	2 hours
Programming Assignment 2 Peer Evaluation	2 hours
Week 4 Quiz	1 hour

Goals and Objectives

Upon successful completion of this module, you will be able to:

Design a visualization system for large datasets and dashboards to support informed decision making.
Create an initial overview of a large dataset, and then find interesting elements through zooming, filtering, and requesting details when needed.
Use visualization as a method for forming effective queries that reveal structure in a database.
Apply techniques from user-interface design to create effective visualization systems.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

How would I use these techniques to design a visualization system or a dashboard for my own datasets?
For the data cube lessons, consider how the operations would be applied to a very simple database. For example, an image can be stored as a database of pixels, with records of the form (x,y,b) where x,y are the pixel coordinates and b is the pixel’s brightness. How would the various data cube visualization methods work on that kind of data?

Additional Readings and Resources

Visualization Analysis and Design by Tamara Munzner (includes some tasks)
Interaction Design: Beyond Human-Computer Interaction by Jenny Preece et al. (My lecture content on user interface design are based on an older version of this textbook.)

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Information visualization and visual analytics
Focus+Context
OLAP, data cubes, and other database concepts
Requirements specification and other system design stages
Synthesizability and other user interface concepts

Tips for Success

To do well this week, I recommend that you do the following:

Review the video lectures a number of times to gain a solid understanding of the key questions and concepts introduced this week.
When possible, provide tips and suggestions to your peers in this class. As a learning community, we can help each other learn and grow. One way of doing this is by helping to address the questions that your peers pose. By engaging with each other, we’ll all learn better.
It’s always a good idea to refer to the video lectures and chapter readings we’ve read during this week and reference them in your responses. When appropriate, critique the information presented.
Take notes while you read the materials and watch the lectures for this week. By taking notes, you are interacting with the material and will find that it is easier to remember and to understand. With your notes, you’ll also find that it’s easier to complete your assignments. So, go ahead, do yourself a favor; take some notes!

Getting and Giving Help

You can get/give help via the following means:

Use the Learner Help Center to find information regarding specific technical problems. For example, technical problems would include error messages, difficulty submitting assignments, or problems with video playback. If you cannot find an answer in the documentation, you can also report your problem to the Coursera staff by clicking on the Contact Us! link available on each topic’s page within the Learner Help Center.
Use the Content Issues forum to report errors in lecture video content, assignment questions and answers, assignment grading, text and links on course pages, or the content of other course materials. University of Illinois staff and Community Mentors will monitor this forum and respond to issues.

Week 4 Introduction 55 sec

https://www.coursera.org/learn/datavisualization/lecture/ubdoC/week-4-introduction

Lesson 4-1: Visualization Systems and Database Visualization

4.1.1. Visualization Systems 3 min

https://www.coursera.org/learn/datavisualization/lecture/7fsLR/4-1-1-visualization-systems

4.1.2. The Information Visualization Mantra: Part 1 9 min

https://www.coursera.org/learn/datavisualization/lecture/BdRy4/4-1-2-the-information-visualization-mantra-part-1

4.1.2. The Information Visualization Mantra: Part 2 9 min

https://www.coursera.org/learn/datavisualization/lecture/Vdwv3/4-1-2-the-information-visualization-mantra-part-2

4.1.2. The Information Visualization Mantra: Part 3 5 min

https://www.coursera.org/learn/datavisualization/lecture/eTQR1/4-1-2-the-information-visualization-mantra-part-3

4.1.3. Database Visualization Part: 1 12 min

https://www.coursera.org/learn/datavisualization/lecture/vtMpA/4-1-3-database-visualization-part-1

4.1.3. Database Visualization Part: 2 8 min

https://www.coursera.org/learn/datavisualization/lecture/5gXTf/4-1-3-database-visualization-part-2

4.1.3. Database Visualization Part: 3 9 min

https://www.coursera.org/learn/datavisualization/lecture/k9fP2/4-1-3-database-visualization-part-3

Lesson 4-2: Visualization System Design

4.2.1. Visualization System Design 14 min

https://www.coursera.org/learn/datavisualization/lecture/Svnib/4-2-1-visualization-system-design

Week 4 Activities

Quiz: Week 4 Quiz8 questions

Due June 4, 11:59 PM PDT

QUIZ
Week 4 Quiz
8 questions
To Pass80% or higher
Attempts3 every 8 hours
Deadline
June 4, 11:59 PM PDT

1 point
1.When creating an overview visualization of a large dataset, it is most important to:

Display only an important subset of the datapoints so as to not overwhelm the user

Display all of the data using a simple representation and axes that spread the data out as much as possible

Pack as many details as possible into the display to be as efficient and informative as possible

1 point
2.The process of zooming on the data plotted in a visualization, where the zoomed region then fills the entire display:

Is an important part of Schneiderman’s information visualization mantra

Provides focus on the zoomed portion of the data

Is actually a filtering operation on the display coordinates of the plotted data

All of the above

None of the above

1 point
3.The goal of the filtering step of the information visualization mantra is to:

Remove outliers

Display a subset

Smooth noisy data

1 point
4.Which of the following benefits of a fisheye lens is LEAST important for data visualization?

It allows zooming without obscuring the unzoomed data.

It supports focus on detail along with the context of that detail amid the rest of the dataset.

It makes the data appear more interesting.

1 point
5.Suppose we have a dataset representing an image consisting of pixel records of the form (x,y,b) where x and y are the spatial coordinates of the pixel and b is the brightness of the pixel. Then, which of the following provides the best histogram of the data?

A subdivision of the image’s (x,y) coordinates into regions, plotting an average pixel brightness for each region

A plot of the average pixel brightness over the y axis, of all pixels that share the same x coordinate

A subdivision of the image’s brightness values into ranges, plotting the count of the pixels whose brightness is in each range

1 point
6.Suppose we have a dataset representing an image consisting of pixel records of the form (x,y,b) where x and y are the spatial coordinates of the pixel and b is the brightness of the pixel. Then, which of the following represents a “rollup” of the x and y dimensions of this dataset?

The average brightness of the image

The position of the center of the image

None of the above

1 point
7.Suppose we have a dataset representing an image consisting of pixel records of the form (x,y,b) where x and y are the spatial coordinates of the pixel and b is the brightness of the pixel. Which axis definition would we NOT use if we wanted to plot the pixel brightness at a unique axis location for every pixel in the image?

An axis formed by nesting the x dimension under the y dimension

An axis formed by the product of the x dimension and the y dimension

An axis formed by the concatenation of the x dimension and the y dimension

1 point
8.When designing a dashboard visualization, what should the primary concern be?

That the dashboard visualization presents all of the data necessary to make an informed decision

That the dashboard visualization engages the user to motivate further study of the data

That the dashboard provide a simple overview of all the data, without any distracting details

View my Certification

Text Retrieval and Search Engines

Course can be found here
Lecture slides can be found here
About this course: Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. Text data are unique in that they are usually generated directly by humans rather than a computer system or sensors, and are thus especially valuable for discovering knowledge about people’s opinions and preferences, in addition to many other kinds of knowledge that we encode in text.

This course will cover search engine technologies, which play an important role in any data mining applications involving text data for two reasons. First, while the raw data may be large for any particular problem, it is often a relatively small subset of the data that are relevant, and a search engine is an essential tool for quickly discovering a small subset of relevant text data in a large text collection. Second, search engines are needed to help analysts interpret any patterns discovered in the data by allowing them to examine the relevant original text data to make sense of any discovered pattern. You will learn the basic concepts, principles, and the major techniques in text retrieval, which is the underlying science of search engines.

resources

ChengXiang Zhai

Welcome to Text Retrieval and Search Engines! You’re joining thousands of learners currently enrolled in the course. I’m excited to have you in the class and look forward to your contributions to the learning community.

To begin, I recommend taking a few minutes to explore the course site. Review the material we’ll cover each week, and preview the assignments you’ll need to complete to pass the course. Click Discussions to see forums where you can discuss the course material with fellow students taking the class.

If you have questions about course content, please post them in the forums to get help from others in the course community. For technical problems with the Coursera platform, visit the Learner Help Center.

Good luck as you get started, and I hope you enjoy the course!

Orientation

You will become familiar with the course, your classmates, and our learning environment. The orientation will also help you obtain the technical skills required for the course.

About the Course

Welcome to Text Retrieval and Search Engines! 10 min

Overview

In this module, you will become familiar with the course, your instructor, your classmates, and our learning environment.

Time

This orientation should take approximately 3 hours to complete.

Goals and Objectives

After this module, you should be able to:

Recall important information about this course.
Get to know your classmates.
Be familiar with how discussion forums operate in the course.

Instructional Activities

Below is a list of the activities and assignments you must complete in this module. Click on the name of each activity for more detailed instructions.

Activity	Estimated Time Required
Watch the Welcome Video and Course Introduction Video	15 minutes
Read and review the Syllabus and About the Discussion Forums pages	30 minutes
Complete the Orientation Quiz	15 minutes
Complete the course Pre-Quiz	30 minutes
Update your Coursera Profile and check out the Social Media page	15 minutes
Post on the Meet and Greet Discussion Board	1 hour

Syllabus 10 min

Course Description

Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. Text data are unique in that they are usually generated directly by humans rather than a computer system or sensors, and are thus especially valuable for discovering knowledge about people’s opinions and preferences, in addition to many other kinds of knowledge that we encode in text.

Course Goals and Objectives

By the end the course, you will be able to do the following:

Explain many basic concepts and multiple major algorithms in text retrieval and search engines.
Explain how search engines and recommender systems work and how to quantitatively evaluate a search engine.
Create a test collection, run text retrieval experiments, and experiment with ideas for improving a search engine (if you complete the programming assignment).

Course Outline

The course consists of 6 weekly modules. Please note: There are no required readings for this course. All readings listed below are optional and are primarily from the textbook

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016.

Week 1
Key Concepts:

Part of Speech tagging, syntactic analysis, semantic analysis, and ambiguity
“Bag of words” representation
Push, pull, querying, browsing
Probability ranking principle
Relevance
Vector space model
Dot product
Bit vector representation

Recommended Readings:

N. J. Belkin and W. B. Croft. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM 35, 12 (Dec. 1992), 29-38.
C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 1 - 6.

Week 2
Key Concepts:

Term frequency (TF)
Document frequency (DF) and inverse document frequency (IDF)
TF transformation
Pivoted length normalization
BM25
Inverted index and postings
Binary coding, unary coding, gamma-coding, and d-gap
Zipf’s law

Recommended Readings:

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 6 - Section 6.3, and Chapter 8.
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann, 1999.

Week 3
Key Concepts:

Cranfield evaluation methodology
Precision and recall
Average precision, mean average precision (MAP), and geometric mean average precision (gMAP)
Reciprocal rank and mean reciprocal rank
F-measure
Normalized Discounted Cumulative Gain (nDCG)
Statistical significance test

Recommended Readings:

Mark Sanderson. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 247-375.
C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 9

Week 4
Key Concepts:

p(R=1|q,d), query likelihood, and p(q|d)
Statistical and unigram language models
Maximum likelihood estimate
Background, collection, and document language models
Smoothing of unigram language models
Relation between query likelihood and TF-IDF weighting
Linear interpolation (i.e., Jelinek-Mercer) smoothing
Dirichlet Prior smoothing

Recommended Readings:

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 6 - Section 6.4

Week 5
Key Concepts:

Relevance feedback
Pseudo-relevance feedback
Implicit feedback
Rocchio feedback
Kullback-Leiber divergence (KL-divergence) retrieval function
Mixture language model
Scalability and efficiency
Spams
Crawler, focused crawling, and incremental crawling
Google File System (GFS)
MapReduce
Link analysis and anchor text
PageRank and HITS

Recommended Readings:

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 7 & 10

Week 6
Key Concepts:

Learning to rank, features, and logistic regression
Content-based filtering
Collaborative filtering
Beta-Gamma threshold learning
Linear utility
User profile
Exploration-exploitation tradeoff
Memory-based collaborative filtering
Cold start

Recommended Readings:

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 10 - Section 10.4, Chapter 11.

Elements of This Course

The course is comprised of the following elements:

Lecture videos. Each week your instructor will teach you the concepts you need to know through a collection of short video lectures. You may either stream these videos for playback within the browser by clicking on their titles, or you can download each video for later offline playback by clicking the download icon.

The videos in each week is usually total 1.5 to 2 hours, but you generally need to spend at least the same amount of time digesting the content in the video. The actual amount of time needed to digest the content would naturally vary according to your background.

Quizzes. Each week will include one for-credit quiz. Your cumulative score will be used when calculating your final score in the class. There is no time limit on how long you take to complete each quiz, and you will be allowed 3 attempts at the quiz every 8 hours. The deadline for all quizzes is the last day of the course.

The quiz in each week should take less than 1 hour to finish, assuming you have mastered the materials to be tested in the quiz, and you should make sure that you have mastered the materials before you attempt to work on the quizzes. You should use the practices to help you understand and master the materials.

Programming assignments. The programming assignments for this course are optional, but they provide an opportunity for you to practice your programming skills and apply what you’ve learned in the course.

The programming assignment is optional and you can budget about 2 hours each week to work on it if you plan to finish it; you may need to budget more time for this if you are not familiar with C++ programming.

Information about Lectures

The lectures in this course contain the most important information you need to know. You can access these lectures in each week’s lesson section. The following resources accompany each video:

The play button will open the video up in your browser window and stream the lecture to you. The duration of the video (in hours-minutes-seconds format) is also listed. Some lectures may include in-video questions described above. Within the player that appears, you can click the CC button to activate closed captions. English captions are available for all videos.
All video lectures have a discussion forum dedicated to them. This is a great place to discuss any questions you have about the content of the video or to share your ideas and responses to the video.

Discussion Forums

The discussion forums are a key element of this course. Be sure to read more about the discussion forums and how you can make the most of them in this class.

How to Pass the Course

I am continually looking to improve this course and may encounter some issues requiring us to make changes sooner rather than later. As such, this syllabus is subject to change. I appreciate your input and ask that you have patience as we make adjustments to this course. I also recognize that this is no ordinary course. You may have different perspectives and different goals for this course than some of your peers, or that I could have anticipated. Therefore, I want to empower you to customize this course to meet your needs.

To qualify for a Course Certificate, simply start verifying your coursework at the beginning of the course, get an 70% or higher on all quizzes and assignments combined, and pay the fee. Coursera Financial Aid is available to offset the registration cost for learners with demonstrated economic needs. If you have questions about Course Certificates, please see the help topics here.

Getting and Giving Help

You can get/give help via the following means:

Use the Learner Help Center to find information regarding specific technical problems. For example, technical problems would include error messages, difficulty submitting assignments, or problems with video playback. If you cannot find an answer in the documentation, you can also report your problem to the Coursera staff by clicking on the Contact Us! link available on each topic’s page within the Learner Help Center.
Use the Content Issues forum to report errors in lecture video content, assignment questions and answers, assignment grading, text and links on course pages, or the content of other course materials. University of Illinois staff and Community Mentors will monitor this forum and respond to issues.

About the Discussion Forums 10 min

Expectations

Helpful Tools

Searching for Posts

Upvoting Posts

Reporting Inappropriate Posts

Following

If you find a particular thread to be interesting to you, click the Follow button under the original post of that thread page to receive email notifications when new posts are made.

Improving Your Posts

Search the other posts to see if your topic is already covered. If it is, reply to that thread instead of starting a new one.
Post in the most appropriate forum for your topic, and do not post the same thing in multiple forums.
Use a meaningful title for your thread.
Be civil. If you disagree, explain your position with respect and refrain from any and all personal attacks.
Stay on topic. In particular, don’t change the subject in the middle of an existing thread – just start a new topic.
Make sure you’re understood, even by non-native English speakers. Try to write full sentences, and avoid text-message abbreviations or slang. Be careful when you use humor and sarcasm as these messages are easy to misinterpret.
If asking a question, provide as much information as possible, what you’ve already considered, what you’ve already read, etc.
Cite appropriate references when using someone else’s ideas, thoughts, or words.
Do not use a forum to promote your product, service, or business.
Do not post personal information about other posters in the forum.
Ignore spammers and report them.
For more details, refer to Coursera’s Code of Conduct.

Updating your Profile 10 min

For some people, social media helps them get even more connected with the people and topics of interest to them. This page is to help you get connected to this course via social media. You are not required to participate in social media in any way. This information is presented merely for those who want to engage with the course in this way. We have provided it to give you an opportunity to further connect with your classmates, learn what others are saying about the class, create study groups, discuss interesting course topics as well as off-topic ideas, and share your own perspectives about this subject.

Connect to your classmates via one or more of the following means:

Where Are We From?

Google Map Instructions

Want to find out where everyone else taking this course lives? Check out this Google Map. If you haven’t done so already, consider adding a pushpin indicating where you live. No need to be too specific, but put a pin on the town or even country in which you live. To do so:

First, you must be logged into your Google account (e.g., a service such as Gmail).
Then, go to the map.
Next, search for your general location via the search box at the top.
Next, click on the pushpin that drops in your general location. If there is no box when you click on the pushpin, then likely you were not logged into Google before clicking the map link.
Click Add to map. The pushpin will turn red when it’s added to the map. No need to specify your name or anything. Just click the OK button.
That’s it!

More Information

If you find another social media page or community related to this course, feel free to share it in the discussion forums.

You may also be interested in liking or following:

NOTE: Do not post links to copyrighted materials in the Coursera discussion forums or on social networks. Doing so is a violation of the Coursera Terms of Service and Facebook Statement of Rights and Responsibilities.

Course Errata 10 min

For updates on changes made to the course, please see the Course Errata website.

Orientation Activities

Course Welcome Video 3 min

https://www.coursera.org/learn/text-retrieval/lecture/04Py5/course-welcome-video

Course Introduction Video 11 min

https://www.coursera.org/learn/text-retrieval/lecture/4nB2z/course-introduction-video

Quiz: Orientation Quiz 5 questions

1 point
1.This course lasts for ___ weeks.

1 point
2.I am required to read a textbook for this course.

False

True

1 point
3.Which of the following activities are required each week to get credit for the course? Check all that apply.

Forum Assignments

Quizzes

Programming Assignments

Reading Assignments

1 point
4.The following tools will help me use the discussion forums:

“Searching” for your question or a topic before you post a new thread.

“Following” any forums that are particularly interesting to me.

“Voting-up” posts that are thoughtful, interesting, or helpful

1 point
5.If I have a problem in the course I should:

Drop the class

Call the instructor

Report it to the Learner Help Center (if the problem is technical) or to the Content Issues forum (if the problem is an error in the course materials).

Email the instructor

Practice Quiz: Pre-Quiz 13 questions

PRACTICE QUIZ
Pre-Quiz
13 questions
To Pass70% or higher
Deadline
May 28, 11:59 PM PDT

1 point
1.Probability & Statistics: Rolling a 6-faced die, what’s the probability of seeing a “6”?

1/6

1/3

1/2

1 point
2.Probability & Statistics: Rolling a 6-faced die, what’s the probability of seeing an even number?

1/3

1/2

1/6

1

1 point
3.Probability & Statistics: Rolling a 6-faced die, given that the number is even, what’s the probability that we’ve got a “6”?

1/2

1/6

1/3

1 point
4.Probability & Statistics: Rolling two independent 6-faced dice, what’s the probability that both dice show the same number?

1/2

1/3

1/6

1

1 point
5.Linear algebra: What’s the value of 2 * x, where x =[1.0,2.0,3.0]?

[2.0, 2.0, 3.0]

[1.0, 2.0, 3.0]

[2.0, 4.0, 6.0]

1 point
6.Linear algebra: If x=[1,2,3] and y=[1,−2,2], what’s the dot product x and y?

2

1 point
7.Linear algebra: What is the result of multiplying matrix M=[1221] by a vector xT=[1,1],

[3,3]

[4,4]

[1,1]

[2,2]

1 point
8.Basic algebra: if x and y are both positive numbers, and x > y, then what can we say about log(x) and log(y)?

log(x) < log(y)

This can’t be determined

log(x) > log(y)

1 point
9.Basic algebra: Which of the following is true?

log(x-y) = log(x) - log(y)

log(xy) = log(x) + log(y)

log (x+y) = log(x) + log(y)

1 point
10.Basic algebra: Which of the following is true (where exp(x) is the exponential function with e as the base)?

exp(x+y) = ln(x)*ln(y)

ln(x+y) = exp(x) + exp(y)

exp(ln (x)) = x

1 point
11.Basic Computer Science: Which of the following operations occur in a computer program faster?

Reading a 32-bit integer from the memory

Reading a 32-bit integer from the hard disk.

1 point
12.Basic Computer Science: Which of the following statements is true about data structures?

A linked list is the best for supporting direct access to any element in the list

A linked list to store k integers would require more storage space than an array to store the same k integers.
Correct
Correct! Linked lists also require storing header information.

Hash table is faster for sequential access to the elements than a linked list

1 point
13.Basic Computer Science: What’s the value of the binary code 1011?

Week 1

During this week’s lessons, you will learn of natural language processing techniques, which are the foundation for all kinds of text-processing applications, the concept of a retrieval model, and the basic idea of the vector space model.

Week 1 Information

Week 1 Overview 10 min

During this week’s lessons, you will learn the overall course design, an overview of natural language processing techniques, which are the foundation for all kinds of text-processing applications, the concept of a retrieval model, and the basic idea of the vector space model.

Time

This module should take approximately 3 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with required assignments in bold):

Activity	Estimated Time Required
Week 1 Video Lectures	2 hours
Week 1 Graded Quiz	1 hour

Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

Explain some basic concepts in natural language processing, text information access.
Explain why text retrieval is often defined as a ranking problem.
Explain the basic idea of the vector space retrieval model and how to instantiate it with the simplest bit-vector representation.

Guiding Questions

Develop your answers to the following guiding questions while watching the video lectures throughout the week.

What does a computer have to do in order to understand a natural language sentence?

Lexical analysis (part-of-speech tagging); Syntactic analysis (Parsing); Pragmatic analysis (speech act)

What is ambiguity?
•Word-level ambiguity: E.g.,
–“design” can be a noun or a verb (Ambiguous POS)
–“root” has multiple meanings (Ambiguous sense)
•Syntactic ambiguity: E.g.,
–“natural language processing” (Modification)
–“A man saw a boy with a telescope.” (PP Attachment)
•Anaphora resolution: “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?)
•Presupposition: “He has quit smoking.” implies that he smoked before.
Why is natural language processing (NLP) difficult for computers?

–Ambiguity is a “killer”!
–Common sense reasoning is pre-required

What is bag-of-words representation? Why do modern search engines use this simple representation of text?

In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
the search task is not all that difficult.

What are the two modes of text information access? Which mode does a web search engine such as Google support?

Pull vs. Push
pull

When is browsing more useful than querying to help a user find relevant information?

Works well when the user wants to explore information, doesn’t know what keywords to use, or can’t conveniently enter a query

Why is a text retrieval task defined as a ranking task?

Document selection(absolute relevance), Document ranking(relative relevance)
Problems of Document Selection
•The classifier is unlikely accurate
–“Over-constrained” query  no relevant documents to return
–“Under-constrained” query  over delivery
–Hard to find the right position between these two extremes
•Even if it is accurate, all relevant documents are not equally relevant (relevance is a matter of degree!)
–Prioritization is needed
•Thus, ranking is generally preferred

What is a retrieval model?

formalization of relevance (give a computational definition of relevance)
•Similarity-based models: f(q,d) = similarity(q,d)
–Vector space model
•Probabilistic models: f(d,q) = p(R=1|d,q), where R {0,1}
–Classic probabilistic model $\rightarrow$ BM25
–Language model $\rightarrow$ Query Likelihood
–Divergence-from-randomness model $\rightarrow$ PL2
•Probabilistic inference model: f(q,d) = p(dq)
•Axiomatic model: f(q,d) must satisfy a set of constraints
•These different models tend to result in similar ranking functions involving similar variables

What are the two assumptions made by the Probability Ranking Principle?

–The utility of a document (to a user) is independent of the utility of any other document
–A user would browse the results sequentially

What is the Vector Space Retrieval Model? How does it work?

Similarity-based models
- Represent a doc/query by a term vector
  - Term: basic concept. e.g.,word or phrase
  - Each term defines one dimension
  - N terms define an N-dimensional space
  - Query vector: q=($x_1$, …$x_N$), $x_i\in R$ is query term weight
  - Doc vector: d=($y_1$, …$y_N$), $y_j\in R$ is doc term weight
- relevance(q,d) $\propto$ similarity(q,d) =f(q,d)

How do we define the dimensions of the Vector Space Model? What does “bag of words” representation mean?

- Represent a doc/query by a term vector
- Term: basic concept. e.g.,word or phrase
- Each term defines one dimension
- N terms define an N-dimensional space
In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

What does the retrieval function intuitively capture when we instantiate a vector space model with bag of words representation and bit representation for documents and queries?

word presence/absence

Additional Readings and Resources

The following readings are optional:

N. J. Belkin and W. B. Croft. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM 35, 12 (Dec. 1992), 29-38.
C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 1-6.

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Part of speech tagging, syntactic analysis, semantic analysis, and ambiguity
“Bag of words” representation
Push, pull, querying, browsing
Probability ranking principle
Relevance
Vector space model
Dot product
Bag of words representation
Bit vector representation

Tips for Success

To do well this week, I recommend that you do the following:

Review the video lectures a number of times to gain a solid understanding of the key questions and concepts introduced this week.
When possible, provide tips and suggestions to your peers in this class. As a learning community, we can help each other learn and grow. One way of doing this is by helping to address the questions that your peers pose. By engaging with each other, we’ll all learn better.
It’s always a good idea to refer to the video lectures and chapter readings we’ve read during this week and reference them in your responses. When appropriate, critique the information presented.
Take notes while you read the materials and watch the lectures for this week. By taking notes, you are interacting with the material and will find that it is easier to remember and to understand. With your notes, you’ll also find that it’s easier to complete your assignments. So, go ahead, do yourself a favor; take some notes!

Getting and Giving Help

You can get/give help via the following means:

Use the Learner Help Center to find information regarding specific technical problems. For example, technical problems would include error messages, difficulty submitting assignments, or problems with video playback. If you cannot find an answer in the documentation, you can also report your problem to the Coursera staff by clicking on the Contact Us! link available on each topic’s page within the Learner Help Center.
Use the Content Issues forum to report errors in lecture video content, assignment questions and answers, assignment grading, text and links on course pages, or the content of other course materials. University of Illinois staff and Community Mentors will monitor this forum and respond to issues.

Week 1 Lessons

Lesson 1.1: Natural Language Content Analysis 21 min

What is Natural Language Processing (NLP)?
State of the Art in NLP
NLP for Text Retrieval

https://www.coursera.org/learn/text-retrieval/lecture/rLpwp/lesson-1-1-natural-language-content-analysis

Lesson 1.2: Text Access 9 min

Modes:

Push
Pull
- Querying
- Browsing

https://www.coursera.org/learn/text-retrieval/lecture/OvxTu/lesson-1-2-text-access

Lesson 1.3: Text Retrieval Problem 26 min

What is Text Retrieval?
Text Retrieval vs. Database Retrieval
Document Selection vs. Document Ranking

https://www.coursera.org/learn/text-retrieval/lecture/CXoWB/lesson-1-3-text-retrieval-problem

Lesson 1.4: Overview of Text Retrieval Methods 10 min

Term Frequency(TF) and Document Frequency(DF)

https://www.coursera.org/learn/text-retrieval/lecture/gxXq6/lesson-1-4-overview-of-text-retrieval-methods

Lesson 1.5: Vector Space Model - Basic Idea 9 min

Represent a doc/query by a term vector
- Term: basic concept. e.g.,word or phrase
- Each term defines one dimension
- N terms define an N-dimensional space
- Query vector: q=($x_1$, …$x_N$), $x_i\in R$ is query term weight
- Doc vector: d=($y_1$, …$y_N$), $y_j\in R$ is doc term weight

https://www.coursera.org/learn/text-retrieval/lecture/o8WNd/lesson-1-5-vector-space-model-basic-idea

Lesson 1.6: Vector Space Retrieval Model - Simplest Instantiation 17 min

Dimension: word
Vector=0-1 bit vector(word presence/absence)
Similarity=dot product
f(q,d) = number of distinct query words mached in d

https://www.coursera.org/learn/text-retrieval/lecture/dM6kh/lesson-1-6-vector-space-retrieval-model-simplest-instantiation

Week 1 Activities

Practice Quiz: Week 1 Practice Quiz 10 questions

PRACTICE QUIZ
Week 1 Practice Quiz
10 questions
To Pass 70% or higher
Deadline June 11, 11:59 PM PDT

1 point
1.Consider the instantiation of the vector space model where documents and queries are represented as term frequency vectors. Assume we have the following query and two documents:
Q = “future of online education”
D1 = “Coursera is shaping the future of online education; online education is affordable.”
D2 = “In the future, online education will dominate.”
Let V(X) = [c1 c2 c3 c4] represent a part of the term frequency vector for document or query X, where c1, c2, c3, and c4 are the term weights corresponding to “future,” “of,” “online,” and “education,” respectively.
Which of the following is true?

V(Q) = [1 1 1 1] V(D1) = [1 1 2 2] V(D2) = [1 1 1 1]

V(Q) = [1 1 1 1] V(D1) = [1 1 2 2] V(D2) = [1 0 1 1]

V(Q) = [1 1 1 1] V(D1) = [1 1 1 1] V(D2) = [1 0 1 1]

1 point
2.Consider the same scenario as in Question 1, with the dot product as the similarity measure. Which of the following is true?

Sim(Q,D1) = 4 Sim(Q,D2) = 3

Sim(Q,D1) = 6 Sim(Q,D2) = 4

Sim(Q,D1) = 6 Sim(Q,D2) = 3

1 point
3.Which is NOT the reason that ranking is preferred over document selection?

Not all relevant documents are equally relevant.

The boundary between relevant vs. not relevant documents is hard to define.

Users prefer to browse results in a sequential manner.

Ranking is computationally easier than selection.

1 point
4.Which of the following application scenarios in text retrieval relies LESS on NLP?

Compare homework submissions from students to check if plagiarism occurs

Use query in English to retrieve documents in French

1 point
5.What information about text is lost using “bag of words” representation?

Number of unique words in document

Word ordering

Word occurrence counts

Phrases formed by multiple words

Document length

1 point
6.Select the following applications that provide “push” information access.

Google Maps (maps.google.com)

Netflix (netflix.com)

Bing (bing.com)

Spotify mobile (iOS/Android app)

1 point
7.Which of the following factors make text retrieval more difficult than database retrieval?

Unstructured/free text query and data

Ambiguity in text

Subjective judgement of relevance and empirical evaluation involving humans

1 point
8.Stop words, such as “the,” “is,” “at,” “which,” and “on” can be identified with TF-IDF:

Low TF, high IDF

High TF, high IDF

High TF, low IDF

Low TF, low IDF

1 point
9.How many syntactic structures can you identified in sentence “A man saw a boy with a telescope”?

3

1 point
10.In VSM model, which of the following similarity/distance measure would be affected by document length?

L2 distance: $||v_1−v_2||_2$

Cosine similarity: $\cos(v_1,v_2)$

Quiz: Week 1 Quiz10 questions

QUIZ
Week 1 Quiz
10 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
June 11, 11:59 PM PDT

1 point
1.The sentence “A man saw a boy with a telescope” is syntactically ambiguous and has two distinct syntactic structures.

True

False

1 point
2.Which of the following is false?

Recommender systems are based on the text push mode.

Search engines rely on the text push mode.

Querying and browsing are both examples of the text pull mode.

Browsing is suitable when the user doesn’t know what keywords to use.

1 point
3.Consider the instantiation of the vector space model where documents and queries are represented as bit vectors. Assume we have the following query and two documents:

Q = “healthy diet plans”

D1 = “healthy plans for weight loss. Check out other healthy plans”

D2 = “the presidential candidate plans to change the educational system.”

Let V(X) = [b1 b2 b3] represent a part of the bit vector for document or query X, where b1, b2, and b3 are the bits corresponding to “healthy,” “diet,” and “plans,” respectively.

Which of the following is true?

V(Q) = [1 1 1] V(D1) = [1 0 1] V(D2) = [0 0 1]

V(Q) = [1 1 1] V(D1) = [1 1 1] V(D2) = [0 0 1]

V(Q) = [1 1 1] V(D1) = [1 1 1] V(D2) = [0 0 0]

V(Q) = [1 1 1] V(D1) = [2 0 2] V(D2) = [0 0 1]

1 point
4.Consider the same scenario as in Question 3, with dot product as the similarity measure. Which of the following is true?

Sim(Q,D1) = 3 Sim(Q,D2) = 0

Sim(Q,D1) = 3 Sim(Q,D2) = 1

Sim(Q,D1) = 2 Sim(Q,D2) = 1

Sim(Q,D1) = 4 Sim(Q,D2) = 1

1 point
5.In the “simplest” VSM instantiation, if instead of using 0-1 bit vectors but we use the word count instead, when we concatenate each document by itself, will the ranking list still remain the same?

False

True

1 point
6.In Text Retrieval problem for N distinct documents, select statements below that are correct?

If use document selection, the number of outcomes is $2^N$

The numbers of outcome for document ranking and selection are the same

If use document ranking, the number of outcomes is N!

Document selection is preferred as there is no need to determine document absolute relevance

1 point
7.Suppose we compute the term vector for a baseball sports news article in a collection of general news articles using TF weighting only. Which of the following do you expect to have the highest weight?

baseball

the

computer

1 point
8.Assume the same scenario as in Question 7, but with TF-IDF weighting. Which of the following words do you expect to have the highest weight in this case?

the

baseball

computer

1 point
9.Consider the following retrieval formula:

Where c(w, D) is the count of word w in document D,

dl is the document length,

avdl is the average document length of the collection,

N is the total number of documents in the collection,

and df (w) is the number of documents containing word w.

In view of TF, IDF weighting, and document length normalization, which part is missing or does not work appropriately?

IDF

Document length normalization

1 point
10.In VSM model, which of the following will be a better way to measure similarity/distance?

Cosine similarity: $\cos(v_1,v_2)$

L2 distance: $||v_1−v_2||_2$

Week 2

In this week’s lessons, you will learn how the vector space model works in detail, the major heuristics used in designing a retrieval function for ranking documents with respect to a query, and how to implement an information retrieval system (i.e., a search engine), including how to build an inverted index and how to score documents quickly for a query.

Week 2 Information

Week 2 Overview 10 min

https://www.coursera.org/learn/text-retrieval/supplement/okACu/week-2-overview

During this week’s lessons, you will learn how the vector space model works in detail, the major heuristics used in designing a retrieval function for ranking documents with respect to a query, and how to implement an information retrieval system (i.e., a search engine), including how to build an inverted index and how to score documents quickly for a query.

Time

This module should take approximately 3 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 2 Video Lectures	2 hours
Week 2 Graded Quiz	1 hour

Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

Explain what TF-IDF weighting is and why TF transformation and document length normalization are necessary for the design of an effective ranking function.
Explain what an inverted index is and how to construct it for a large set of text documents that do not fit into the memory.
Explain how variable-length encoding can be used to compress integers and how unary coding and gamma-coding work.
Explain how scoring of documents in response to a query can be done quickly by using an inverted index.
Explain Zipf’s law.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

What are some different ways to place a document as a vector in the vector space?

Improved Vector Placement: Term Frequency Vector
xi = count of word Wi in query; yi = count of word Wi in doc
Further Improvement of Vector Placement: Adding Inverse Document Frequency (IDF)
xi = count of word Wi in query; yi = c(Wi ,d) *IDF(Wi)

What is term frequency (TF)?

count of word Wi in query or doc

What is TF transformation?

c(w,d) $\rightarrow$ TF(w,d); it is used to
–capture the intuition of “diminishing return” from higher TF
–avoid dominance by one single term over all others
such as BM25 Transformation
–has an upper bound
–is robust and effective

What is document frequency (DF)?

total number of docs containing word W (Doc Frequency)

What is inverse document frequency (IDF)?

IDF Weighting: Penalizing Popular Terms
IDF(W) = log[(M+1)/k]
M: total number of docs in collection
k: total number of docs containing word W (Doc Frequency)

What is TF-IDF weighting?

$f(q,d) = \sum_{i=1}^{N}x_i y_i = \sum_{w\in q \cap d}^{}c(w,q) c(w,d)log\frac{M+1}{df(w)}$

Why do we need to penalize long documents in text retrieval?

•Penalize a long doc with a doc length normalizer
–Long doc has a better chance to match any query
–Need to avoid over-penalization
•A document is long because
–it uses more words $\rightarrow$ more penalization
–it has more contents $\rightarrow$ less penalization

What is pivoted document length normalization?

average doc length as “pivot”
–Normalizer = 1 if |d| =average doc length (avdl)
$$normalizer = 1 - b + b\frac{|d|}{avdl}, b \in [0,1]$$

What are the main ideas behind the retrieval function BM25?
TF-IDF weighting, TF transformation, Pivoted length normalizer
Address the problem of over penalization of long documents by BM25
What is the typical architecture of a text retrieval system?

Offline: doc $\rightarrow$ tokenizer $\rightarrow$ index $\rightarrow$ indexer
Online: query
Offline & Online: judgments $\rightarrow$ feedback
indexer + query + feedback $\rightarrow$ scorer $\rightarrow$ result

What is an inverted index?

Indexing = Convert documents to data structures that enable fast search (precomputing as much as we can)
Inverted index is the dominating indexing method for supporting basic search algorithms

Why is it desirable to compress an inverted index?

•The main difficulty is to build a huge index with limited memory
•Memory-based methods: not usable for large collections

How can we create an inverted index when the collection of documents does
not fit into the memory?

•TF compression: Binary code, unary code, $\gamma$-code, $\delta$-code,
•Doc ID compression: –“d-gap” (store difference): d1, d2-d1, d3-d2,…

How can we leverage an inverted index to score documents quickly?

•Caching (e.g., query results, list of inverted index)
•Keep only the most promising accumulators
•Scaling up to the Web-scale? (need parallel processing)

Additional Readings and Resources

The following readings are optional:

C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 6 - Section 6.3, and Chapter 8.
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann, 1999.

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Term frequency (TF)
Document frequency (DF) and inverse document frequency (IDF)
TF transformation
Pivoted length normalization
BM25
Inverted index and postings
Binary coding, unary coding, gamma-coding, and d-gap
Zipf’s law

Week 2 Lessons

Start Lesson

Lesson 2.1: Vector Space Model - Improved Instantiation 16 min

Improved VSM
– Dimension = word
– Vector = TF-IDF weight vector
– Similarity = dot product
– Working better than the simplest VSM
– Still having problems

TF = term / length of doc
DF = docs containing term / entire collection
IDF = log[(entire collection +1) / number of docs containing term]
https://www.coursera.org/learn/text-retrieval/lecture/7jqJI/lesson-2-1-vector-space-model-improved-instantiation
course error: Vector = IDF weight vector

Lesson 2.2: TF Transformation 9 min

Sublinear TF Transformation is needed to
– capture the intuition of “diminishing return” from higher TF
– avoid dominance by one single term over all others
BM25 Transformation
– has an upper bound
– is robust and effective
Ranking function with BM25 TF (k >=0)

$f(q,d) = \sum_{i=1}^{N}x_i y_i = \sum_{w\in q \cap d}^{}c(w,q) \frac{(k+1)c(w,d)}{c(w,d)+k}log\frac{M+1}{df(w)}$
https://www.coursera.org/learn/text-retrieval/lecture/W0NZe/lesson-2-2-tf-transformation

Lesson 2.3: Doc Length Normalization 18 min

Relevance(q,d) = similarity(q,d)

Query and documents are represented as vectors
Heuristic design of ranking function
Major term weighting heuristics
– TF weighting and transformation
– IDF weighting
– Document length normalization
BM25 and Pivoted normalization seem to be most
effective

https://www.coursera.org/learn/text-retrieval/lecture/RnXhr/lesson-2-3-doc-length-normalization

Lesson 2.4: Implementation of TR Systems 21 min

https://www.coursera.org/learn/text-retrieval/lecture/2Cbq9/lesson-2-4-implementation-of-tr-systems

Lesson 2.5: System Implementation - Inverted Index Construction 18 min

https://www.coursera.org/learn/text-retrieval/lecture/PgzsP/lesson-2-5-system-implementation-inverted-index-construction

Lesson 2.6: System Implementation - Fast Search 17 min

Some Text Retrieval Toolkits

Lucene: http://lucene.apache.org/
Lemur/Indri: http://www.lemurproject.org/
Terrier: http://terrier.org/
MeTA: http://meta-toolkit.github.io/meta/
More can be found at http://timan.cs.uiuc.edu/resources
Inverted index and its construction
– Preprocess data as much as we can
– Compression when appropriate
Fast search using inverted index
– Exploit inverted index to accumulate scores for documents matching
a query term
– Exploit Zipf’s law to avoid touching many documents not matching
any query term
– Can support a wide range of ranking algorithms
Great potential for further scaling up using distributed file
system, parallel processing, and caching

https://www.coursera.org/learn/text-retrieval/lecture/QKK7y/lesson-2-6-system-implementation-fast-search

Week 2 Activities

Practice Quiz: Week 2 Practice Quiz 10 questions

Week 2 Practice Quiz
10 questions
To Pass70% or higher
Deadline
June 18, 11:59 PM PDT

1 point
1.Let w1, w2, and w3 represent three words in the dictionary of an inverted index. Suppose we have the following document frequency distribution:

Word	Document Frequency
w1	1
w2	5
w3	10

Assume that each posting entry of document ID and term frequency takes exactly the same disk space. Which word’s postings list will occupy the largest disk space?

w3

1 point
2.Assume we have the same scenario as in Question 1. If we enter a query Q= “w1 w2 w3” then the maximum possible number of accumulators needed to score all the matching documents is:

5

1 point
3.Assume that the d-gap between two documents is equal to 9. If you want to compress this d-gap with a gamma code, what will be the binary representation of the code?

1110010

1110011

1110001

1110000

1 point
4.Why is TF transformation needed?

To capture the intuition of “diminishing return” from higher TF

So that computation is more efficient

1 point
5.What is the upperbound for BM25 transformation?

k+1

k-1

k

1 point
6.Do we always want to penalize a long document?

Yes

No

1 point
7.Which is true about pivoted length normalization?

It always rewards.

It has both a penalization and reward effect.

It always penalizes.

1 point
8.Is word segmentation on Chinese easier than English?

Yes

1 point
9.What is NOT the advantage for using inverted index?

Inverted index can map words of the same meaning into one slot.

It can search for documents that contains both “A” and “B” efficiently.

It is more efficient than sequentially scanning docs.

1 point
10.What does Zipf’s law tell you?

Words are evenly distributed.

There are many words that have a small probability.

There are only a few words that have a small probability.

Quiz: Week 2 Quiz 12 questions

Due June 18, 11:59 PM PDT

QUIZ
Week 2 Quiz
12 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
June 18, 11:59 PM PDT

1 point
1.Let w1, w2, and w3 represent three words in the dictionary of an inverted index. Suppose we have the following document frequency distribution:

Word	Document Frequency
w1	1000
w2	100
w3	10

Assume that each posting entry of document ID and term frequency takes exactly the same disk space. Which word, if removed from the inverted index, will save the most disk space?

We cannot tell from the given information.

w2

1 point
2.Assume we have the same scenario as in Question 1. If we enter the query Q= “w1 w2” then the minimum possible number of accumulators needed to score all the matching documents is:

100

1100

1000

10

1 point
3.The gamma code for the term frequency of a certain document is 1110010. What is the term frequency of the document?

9

1 point
4.When using an inverted index for scoring documents for queries, a shorter query always uses fewer score accumulators than a longer query.

False

True

1 point
5.What is the advantage of tokenization (normalize and stemming) before index?

Extracts words as lexical units from strings of text

Improves performance by mapping words with similar meanings into the same indexing term

Reduces the number of terms (size of vocabulary)

1 point
6.What can’t an inverted index alone do for fast search?

Search document contains “A” and “B”

Search document contains “A” or “B”

Retrieve documents that are relevant to the query

1 point
7.If Zipf’s law does not hold, will an inverted index be much faster or slower?

Faster

Slower

1 point
8.In BM25, the TF after transformation has upper bound

k +1

1

1 point
9.Which of the following are weighing heuristics for the vector space model?

TF weighting and transformation

Document length normalization

IDF weighting

1 point
10.Which of the following integer compression has equal-length coding?

Unary

Binary

γ-code

1 point
11.Consider the following retrieval formula:

Where c(w, D) is the count of word w in document D,

dl is the document length,

avdl is the average document length of the collection,

N is the total number of documents in the collection,

and df (w) is the number of documents containing word w.

In view of TF, IDF weighting, and document length normalization, which part is missing or does not work appropriately?

IDF

Document length normalization

1 point
12.Suppose we compute the term vector for a baseball sports news article in a collection of general news articles using TF-IDF weighting. Which of the following words do you expect to have the highest weight in this case?

baseball

computer

the

Week 3

ChengXiang Zhai

In this week’s lessons, you will learn how to evaluate an information retrieval system (a search engine), including the basic measures for evaluating a set of retrieved results and the major measures for evaluating a ranked list, including the average precision (AP) and the normalized discounted cumulative gain (nDCG), and practical issues in evaluation, including statistical significance testing and pooling.

Week 3 Information

Week 3 Overview 10 min

https://www.coursera.org/learn/text-retrieval/supplement/0wyle/week-3-overview

Time

This module should take approximately 3-6 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 3 Video Lectures	2 hours
Week 3 Graded Quiz	1 hour
Optional Programming Assignment 1	3 hours

Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

Explain the Cranfield evaluation methodology and how it works for evaluating a text retrieval system.
Explain how to evaluate a set of retrieved documents and how to compute precision, recall, and F1.
Explain how to evaluate a ranked list of documents.
Explain how to compute and plot a precision-recall curve.
Explain how to compute average precision and mean average precision (MAP).
Explain how to evaluate a ranked list with multi-level relevance judgments.
Explain how to compute normalized discounted cumulative gain.
Explain why it is important to perform statistical significance tests.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

Why is evaluation so critical for research and application development in text retrieval?

•Reason 1: Assess the actual utility of a TR system
–Measures should reflect the utility to users in a real application
–Usually done through user studies (interactive IR evaluation)
•Reason 2: Compare different systems and methods
–Measures only need to be correlated with the utility to actual users, thus don’t have to accurately reflect the exact utility to users
–Usually done through test collections (test set IR evaluation)

How does the Cranfield evaluation methodology work?

•Idea: Build reusable test collections & define measures
–A sample collection of documents (simulate real document collection)
–A sample set of queries/topics (simulate user queries)
–Relevance judgments (ideally made by users who formulated the queries)  Ideal ranked list
–Measures to quantify how well a system’s result matches the ideal ranked list
•A test collection can then be reused many times to compare different systems

How do we evaluate a set of retrieved documents?

Precision and Recall;Combine Precision and Recall: F-Measure

How do you compute precision, recall, and F1?

$precision = \frac{a}{a+c}$ $recall = \frac{a}{a+b}$ $F_{\beta} = \frac{(\beta^2+1)P*R}{\beta^2P+R}$
a: Relevant Retrieved, b: Relevant Rejected, c: Irrelevant Retrieved

How do we evaluate a ranked list of search results?

Precision-Recall curve
•Average Precision:
–The average of precision at every cutoff where a new relevant document is retrieved
–Normalizer = the total # of relevant docs in collection
–Sensitive to the rank of each relevant document
•Mean Average Precisions (MAP)
–MAP = arithmetic mean of average precision over a set of queries
–gMAP = geometric mean of average precision over a set of queries

How do you compute average precision? How do you compute mean average precision (MAP) and geometric mean average precision (gMAP)?

–The average of precision at every cutoff where a new relevant document is retrieved
–Normalizer = the total # of relevant docs in collection
–Sensitive to the rank of each relevant document
arithmetic mean of average precision over a set of queries
geometric mean of average precision over a set of queries

What is mean reciprocal rank?

•When there’s only one relevant document in the collection (e.g., known item search)
–Average Precision = Reciprocal Rank = 1/r, where r is the rank position of the single relevant doc
–Mean Average Precision $\rightarrow$ Mean Reciprocal Rank

Why is MAP more appropriate than precision at k documents when comparing two retrieval methods?

•Precision-Recall curve characterizes the overall accuracy of a ranked list
•The actual utility of a ranked list depends on how many top-ranked results a user would examine
•Average Precision is the standard measure for comparing two ranking methods
–Combines precision and recall
–Sensitive to the rank of every relevant document

Why is precision at k documents more meaningful than average precision from a user’s perspective?

precision at k documents is more intuitive

How can we evaluate a ranked list of search results using multi-level relevance judgments?

Normalized Discounted Cumulative Gain (nDCG)

How do you compute normalized discounted cumulative gain (nDCG)?

Relevance level: r=1 (non-relevant) , 2 (marginally relevant), 3 (very relevant)

doc	gain	cumulative gain	Discounted Cumulative Gain
D1	3	3	3
D2	2	3+2	3+2/log 2
D3	1	3+2+1	3+2/log 2+1/log 3
D4	1	3+2+1+1
D5	3
D6	1
D7	1
D8	2
D9	1

DCG@10 = 3+2/log 2+1/log 3 +…+ 1/log 10
IdealDCG@10 = 3+3/log 2+3/log 3 +…+ 3/log 9+ 2/log 10
Assume: there are 9 documents rated “3” in total in the collection
$Normalized DCG @10= \frac{DCG@10}{idealDCG@10}$

Why is normalization necessary in nDCG? Does MAP need a similar normalization? Why is it important to perform statistical significance tests when we compare the retrieval accuracies of two search engine systems?

different topics same scale
No
How sure can you be that an observed difference doesn’t simply result from the particular queries you chose?

Additional Readings and Resources

Mark Sanderson. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 247-375.
C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 9

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Cranfield evaluation methodology
Precision and recall
Average precision, mean average precision (MAP), and geometric mean average precision (gMAP)
Reciprocal rank and mean reciprocal rank
F-measure
Normalized discounted cumulative Gain (nDCG)
Statistical significance test

Week 3 Lessons

Lesson 3.1: Evaluation of TR Systems 10 min

https://www.coursera.org/learn/text-retrieval/lecture/YSvkh/lesson-3-1-evaluation-of-tr-systems

The Cranfield Evaluation Methodology
– A sample collection of documents (simulate real document collection)
– A sample set of queries/topics (simulate user queries)
– Relevance judgments (ideally made by users who formulated the queries)  Ideal
ranked list
– Measures to quantify how well a system’s result matches the ideal ranked list

Lesson 3.2: Evaluation of TR Systems - Basic Measures 12 min

https://www.coursera.org/learn/text-retrieval/lecture/VMh3Z/lesson-3-2-evaluation-of-tr-systems-basic-measures

Precision: are the retrieved results all relevant?
• Recall: have all the relevant documents been retrieved?
• F measure combines Precision and Recall
• Tradeoff between Precision and Recall depends on the
user’s search task

Lesson 3.3: Evaluation of TR Systems - Evaluating Ranked Lists - Part 1 15 min

https://www.coursera.org/learn/text-retrieval/lecture/rU7LT/lesson-3-3-evaluation-of-tr-systems-evaluating-ranked-lists-part-1

PR curves
average precision

Lesson 3.4: Evaluation of TR Systems - Evaluating Ranked Lists - Part 2 10 min

https://www.coursera.org/learn/text-retrieval/lecture/8Q2Tw/lesson-3-4-evaluation-of-tr-systems-evaluating-ranked-lists-part-2

Mean Average Precisions (MAP)
Mean Reciprocal Rank
Average Precision is the standard measure for comparing
two ranking methods
– Combines precision and recall
– Sensitive to the rank of every relevant document

Lesson 3.5: Evaluation of TR Systems - Multi-Level Judgements 10 min

https://www.coursera.org/learn/text-retrieval/lecture/uGa00/lesson-3-5-evaluation-of-tr-systems-multi-level-judgements

Main idea of nDCG@k documents
– Measure the total utility of the top k documents to a
user
– Utility of a lowly ranked document is discounted
– Normalized to ensure comparability across queries

Lesson 3.6: Evaluation of TR Systems - Practical Issues 15 min

https://www.coursera.org/learn/text-retrieval/lecture/thRNy/lesson-3-6-evaluation-of-tr-systems-practical-issues

Week 3 Activities

Practice Quiz: Week 3 Practice Quiz 10 questions

PRACTICE QUIZ
Week 3 Practice Quiz
10 questions
To Pass70% or higher
Deadline
June 25, 11:59 PM PDT

1 point
1.Suppose a query has a total of 5 relevant documents in a collection of 100 documents. System A and System B have each retrieved 10 documents, and the relevance status of the ranked lists is shown below:

System A: [+ + - - - - - - - -]

System B: [- + - - + - - - - +]

where the leftmost entry corresponds to the highest ranked document, and the rightmost entry corresponds to the lowest ranked document. A “+” indicates a relevant document and a “-” corresponds to a non-relevant one. For example, the top ranked document retrieved by System A is relevant, whereas the top ranked document retrieved by B is non-relevant.

What is the precision at 10 documents of both systems?

P(A) = 2/10 P(B) = 3/10

P(A) = 2/100 P(B) = 3/100

P(A) = 2/5 P(B) = 3/5

P(A) = 8/100 P(B) = 7/100

1 point
2.Assume the same scenario as in Question 1. What is the recall of both systems?

R(A) = 2/5 R(B) = 3/5

R(A) = 2/100 R(B) = 3/100

R(A) = 2/10 R(B) = 3/10

R(A) = 8/100 R(B) = 7/100

1 point
3.Assume the same scenario as in Question 1. What is the average precision of both systems?
(1/1 + 2/2 + 0+0+0)/5 = 2/5
(1/2+2/5+3/10+0+0)/5=6/25

AP(A) = 2/5 AP(B) = 6/25

AP(A) = 2/10 AP(B) = 3/25

AP(A) = 2/100 AP(B) = 3/250

AP(A) = 3/10 AP(B) = 9/20

1 point
4.Assume you have two retrieval systems X and Y. If X has a higher MAP (mean average precision), can Y have a higher gMAP (geometric mean average precision)?
MAP_x > MAP_y
MAP_x >= gMAP_x

Yes

No

1 point
5.If system A has higher precision at k document than system B for any number of k, does it mean A also has higher recall than B at any position?

Yes

No

1 point
6.F-measure contains a parameter that weighs between precision and recall. For an automatic system that filters tweets in the search for possible communication between terrorists about attack plans on US soil, when evaluating the system’s performance with F-measure, should we use a higher parameter or lower?

Lower

Higher

1 point
7.What is nDCG capable of but not DCG?

Compare two systems performed on a set of queries

Compare two systems performed on one single query

Work with relevance judgement that is multi-level

Work with relevance judgement that is binary-level (relevant vs. not relevant)

1 point
8.Why is pooling useful?

So that we don’t need to judge every document

So that we don’t need humans to do judgement

So that all documents can be judged efficiently

1 point
9.Which of the following is not correct?

Precision@10docs is easy for users to interpret.

MAP and nDCG are good for comparing ranking algorithms.

DCG is better than nDCG as its value is within [0, 1].

1 point
10.If in PR (precision, recall) curves, curve A is above B for all recall, what can you say?

A is better than B.

B is better than A.

There is no clear conclusion about A vs. B.

A is as good as B.

Quiz: Week 3 Quiz 10 questions

Due June 25, 11:59 PM PDT

QUIZ
Week 3 Quiz
10 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
June 25, 11:59 PM PDT

1 point
1.Suppose a query has a total of 4 relevant documents in the collection. System A and System B have each retrieved 10 documents, and the relevance status of the ranked lists is shown below:

System A: [- + - - - - - - - -]

System B: [+ + - - - - - - - -]

where the leftmost entry corresponds to the highest ranked document, and the rightmost entry corresponds to the lowest ranked document. A “+” indicates a relevant document and a “-” corresponds to a non-relevant one. For example, the top ranked document retrieved by System A is non-relevant, whereas the top ranked document retrieved by B is relevant.

What is the precision at 10 documents of both systems?

P(A) = 1/40 P(B)= 2/40

P(A) = 1/10 P(B)= 2/10

P(A) = 1/4 P(B)= 2/4

P(A) = 9/10 P(B)= 8/10

1 point
2.Assume the same scenario as in Question 1. What is the recall of both systems?

R(A) = 1/40 R(B)= 2/40

R(A) = 1/10 R(B)= 2/10

R(A) = 1/4 R(B)= 2/4

R(A) = 9/10 R(B)= 8/10

1 point
3.Assume the same scenario as in Question 1. What is the average precision of both systems?
(1/2)/4
(1/1 + 2/2 )/4

AP(A) = 7/20 AP(B) = 7/10

AP(A) = 1/10 AP(B) = 1/5

AP(A) = 1/8 AP(B) = 1/2

AP(A) = 1/20 AP(B) = 1/5

1 point
4.Assume you have two retrieval systems X and Y. For a specific query, system X has a higher precision at 10 documents compared to Y. Can system Y have a higher average precision on the same query?

Yes

No

1 point
5.Can a retrieval system have an F1 score of 0.75 and a precision of 0.5?

Yes

1 point
6.For any ranked list of search results, precision at 10 documents is always higher than precision at 20 documents.

False

True

1 point
7.What can you say about the precision-recall (PR) curve?

It is always monotonically decreasing.

The ideal system should have the PR curve as a horizontal line.

It is always monotonically increasing.

1 point
8.Which is correct about average precision?

It does not show the difference between ranks of relevant documents.

It combines precision and recall.

1 point
9.Which of the following is NOT true about Cranfield evaluation methodology?

It does not involve humans to make relevance judgments.

It simulates user queries.

It simulates real document collections.

1 point
10.Which of following is wrong about nDCG@k?

It discounts only top ranked documents.

It has a range between 0 and 1.

It can be used to compare across queries.

Optional Honors Content

Honors Track Programming Assignment

Programming Assignments Overview 10 min

Overview

For the programming assignments, you are going to use MeTA, a C++ data sciences toolkit that supports a variety of tasks such as text retrieval, natural language processing, and machine learning. The main focus in this course will be on how to use and extend the text retrieval methods implemented in MeTA. The assignments assume a very basic understanding of C++ and elementary data structures. That being said, even if your experience is in languages other than C++, you are highly encouraged to go through the assignments, as the focus will be on exploring text retrieval concepts, not on testing your knowledge of the syntax.

This assignment will explore the main text processing techniques implemented in MeTA and then guide you through implementing a major component of a search engine. It will also involve a competition where you can freely create and optimize your own search engine. Although the assignment starts in Week 3, we encourage you to install the toolkit as soon as possible.

If you have questions about the installation process, use the Programming Assignments Forum. This is a great place to ask questions and also help your fellow classmates.

Software Installation

There are two software prerequisites you must install in order to complete the assignments:

MeTA
Python 2.7 (used only for uploading your assignment, not for coding)

To install the MeTA tool, please follow this page. Currently, most major OS including OS X, Ubuntu (and other Linux distributions), and Windows are supported.

windows setup guide

follow this website

something to note:

git clone https://github.com/meta-toolkit/meta.git or quick download zip files to C:\msys64\home\SSQ\meta

SSQ@SSQ-PC MINGW64 ~
$ git clone https://github.com/meta-toolkit/meta.git
正克隆到 'meta'...
remote: Counting objects: 33329, done.
remote: Total 33329 (delta 0), reused 0 (delta 0), pack-reused 33329
接收对象中: 100% (33329/33329), 30.84 MiB | 23.00 KiB/s, 完成.
处理 delta 中: 100% (25323/25323), 完成.

'deps/bandit'（https://github.com/joakimkarlsson/bandit.git）
'deps/cpptoml'（https://github.com/skystrife/cpptoml.git）
'deps/findicu'（https://github.com/julp/FindICU.cmake.git）
'deps/libsvm-modules'（https://github.com/meta-toolkit/meta-libsvm.git）
 'deps/meta-cmake'（https://github.com/meta-toolkit/meta-cmake.git）
 'deps/meta-stlsoft'（https://github.com/meta-toolkit/meta-stlsoft.git）

or quick download zip files to C:\msys64\home\SSQ\meta\deps

after typing make and hit enter

Scanning dependencies of target meta-io
[  3%] Building CXX object src/io/CMakeFiles/meta-io.dir/filesystem.cpp.obj
[  4%] Building CXX object src/io/CMakeFiles/meta-io.dir/gzstream.cpp.obj
[  4%] Building CXX object src/io/CMakeFiles/meta-io.dir/libsvm_parser.cpp.obj
[  4%] Building CXX object src/io/CMakeFiles/meta-io.dir/mmap_file.cpp.obj
[  4%] Building C object src/io/CMakeFiles/meta-io.dir/mman-win32/mman.c.obj
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c: In function '__map_mman_error':
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c:12:56: warning: unused parameter 'deferr' [-Wunused-parameter]
 static int __map_mman_error(const DWORD err, const int deferr)
                                                        ^~~~~~
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c: In function 'mmap':
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c:58:18: warning: unused parameter 'addr' [-Wunused-parameter]
 void* mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off)
                  ^~~~
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c: In function 'munmap':
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c:129:31: warning: unused parameter 'len' [-Wunused-parameter]
 int munmap(void *addr, size_t len)
                               ^~~
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c: In function 'msync':
C:/msys64/home/SSQ/meta/src/io/mman-win32/mman.c:152:39: warning: unused parameter 'flags' [-Wunused-parameter]
 int msync(void *addr, size_t len, int flags)
                                       ^~~~~
[  5%] Building CXX object src/io/CMakeFiles/meta-io.dir/xzstream.cpp.obj
[  5%] Linking CXX static library ../../lib/libmeta-io.a
[  5%] Built target meta-io
Scanning dependencies of target meta-filters

Programming Assignment: Programming Assignment 13h

Due June 25, 11:59 PM PDT

MeTA Overview

Before starting the assignment, you are highly encouraged to read MeTA’s system overview to gain a high level understanding of how MeTA operates. Throughout the assignments, if you want to know more about a certain class or function, you can use the search toolbar in MeTA’s documentation, which provides a brief explanation of the different modules.

If you have questions about the programming assignment, use the Programming Assignments Forum. This is a great place to ask questions and also help your fellow classmates.

Downloading the Assignment

In what follows, we assume that you have already installed MeTA and the directory meta is located in ~/Desktop/.

Download Assignment_1.tar.gz (below) and extract it into the parent directory of meta. For example, if meta is in ~/Desktop/, then you should extract the assignment in ~/Desktop/
In the terminal, change the directory to Assignment_1 and run the bash script Setup.sh. This can be done using the following two commands:

1 2	cd Assignment_1/ ./setup.sh

Now using MeTA as a library, to compile:

1
2
3

cd Assignment_1/build
cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Release
make -j8

Download Assignment_1.tar.gz:
Assignment_1.tar.gz

Warm Up!

In this section you will start by exploring some of the basic text processing techniques such as tokenization, stemming, and stopword removal. These techniques are usually applied to the corpus prior to the indexing stage. You will also perform part-of-speech tagging on a text document.

Tokenization is the process of segmenting a stream of words (such as a document) into a sequence of units called tokens. Loosely speaking, when tokenization is performed on the word level, the tokens will be the words in the document. To perform tokenization, MeTA relies on the default segmentation boundaries between words and sentences defined by the Unicode Standard (see Unicode Standard Annex #29 for more information).

Typically, after tokenization is performed, a number of text filtering techniques are applied on the tokens in order to perform more effective and efficient retrieval. These techniques include:

Stopword removal: Stopwords are frequent words that are not informative for the task at hand. For most retrieval applications, words such as “in,” “or,” “have,” and “the” are not useful for identifying the relevant documents, and thus, they are discarded before the indexing stage. This considerably helps in reducing the size of the inverted index since stopwords occur very frequently and tend to have large postings lists. MeTA uses a list of stopwords saved in /meta/data/lemur-stopwords.txt. Feel free to open the file and have a look at some of the popular stopwords.
Conversion to lower case: For most applications, converting all letters to lower case can help in boosting the retrieval performance. The intuition here is that uppercase and lowercase forms of words usually refer to the same concept and should not be treated as orthogonal dimensions. However, this conversion can lead to inaccuracies in certain situations. For example, a proper noun like “CAT” (a construction company) will have the same representation as the common noun “cat.”
Stemming: It is the process of converting words back to their original stems or roots. For example, the words “retrieve,” “retrieval,” and “retrieving” will all be mapped to the same root “retrieve.” This can prevent different forms of the same word from being treated as orthogonal concepts. MeTA uses a very well known stemmer called the Porter2 English Stemmer; see English (Porter 2) stemming algorithm for more information on how this stemmer operates.
Now you are ready to apply these concepts using MeTA!

Task 1: Stopword Removal (5 pts)

In this task you will run tokenization and stopword removal on a document. Go to Assignment_1/build/Assignment1/. You should find a text document called doc.txt. Open doc.txt and have a look at its content. The document contains the description found on the main page of this course.

In the terminal, perform the following:

cd Assignment_1/build
cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Release
make -j8
./analyze ../config.toml Assignment1/doc.txt --stop

After running the above, if everything is setup correctly, then you should see:

1
2
3

$ ./analyze ../config.toml Assignment1/doc.txt --stop
Running stopword removal -> 
  file saved as Assignment1/doc.stops.txt

Open the output file Assignment_1/build/Assignment1/doc.stops.txt and you should see how tokenization and stopword removal have been applied.

In the 2nd command above, you have executed a compiled program called analyze and passed to it a configuration file called config.toml, the document’s path, and the parameter “stop,” which tells the program to remove stopwords. The analyze is a demo program that supports multiple text analysis functions. Passing config.toml to programs in MeTA is a default practice as it gives the program all the required parameters to run.

Open Assignment_1/config.toml and examine it. You should see a list of configuration parameters, including the directory of the list of stopwords (which should appear in the first line). If you have a new list of stopwords, you can simply point MeTA to it (but do not change the list for this assignment).

After finishing Task 1, you should submit the output file using the submission script. In the shell, execute:

1 2	cd Assignment_1/build/Assignment1 wc < doc.stops.txt \| sed 's/^\s*//g' > doc.stops.txt.wc

Submit the file docs.stops.txt.wc to Task 1. In the My submission tab, click + Create submission. Then click on Task 1: Stopword Removal and upload your file.

Task 2: Stemming (10 pts)

Now you will perform tokenization and stemming on the same document doc.txt. The analyze program you ran in Task 1 provides several other text analysis functionalities, including stemming. For the different functions implemented in analyze, open Assignment_1/analyze.cpp and have a look. In addition, simply run ./analyze and you should see:

$ ./analyze
Usage: ./analyze config.toml file.txt [OPTION]
where [OPTION] is one or more of:        
  --stem  perform stemming on each word        
  --stop  remove stopwords        
  --stopstem      remove stopwords and perform stemming        
  --pos   annotate words with POS tags        
  --pos-replace   replace words with their POS tags        
  --parse create grammatical parse trees from file content          
  --freq-unigram  sort and count unigram words        
  --freq-bigram   sort and count bigram words        
  --freq-trigram  sort and count trigram words        
  --all   run all options

Your task is to perform stemming on doc.txt using the analyze program.

After running stemming, you should see a new file named doc.stems.txt in output directory Assignment_1/build/Assignment1/. Examine the file and see how words are mapped back to their stems.

Now do the word count again and

1	wc < doc.stems.txt \| sed 's/^\s*//g' > doc.stems.txt.wc

Submit your output file doc.stems.txt.wc. In the My submission tab, click + Create submission. Upload your file to Task 2: Stemming.

solution

open mingw64.exe
run cd Assignment_1/build and ./analyze to check the analyze.exe function.
run ./analyze ../config.toml Assignment1/doc.txt --stem to generate stemmed text file
1
2
Running stemming algorithm
-> file saved as Assignment1/doc.stems.txt
cd Assignment1
run wc < doc.stems.txt | sed 's/^\s*//g' > doc.stems.txt.wc
submit

Task 3: Part-of-Speech Tagging (10 pts)

As you have learned in the lecture on natural language processing, part-of-speech tagging is the process of assigning parts of speech (such as verb, noun, adjective, adverb, etc.) to the words in a given text. Your task is to do POS tagging (without replacement) on doc.txt using the analyze program. After executing the program, examine the output in doc.pos-tagged.txt. You should see how a POS tag is assigned to each word after the underscore. Note that the tags are abbreviated; for example, the tag “NN” stands for a noun. For a list of commonly used tags and their meanings see the UPenn Treebank list. For a more comprehensive covering of POS tagging, you can check MeTA’s POS Tagging Tutorial. Don’t forget to submit your output file after running the following!
wc < doc.pos-tagged.txt | sed 's/^\s*//g' > doc.pos-tagged.txt.wc

solution

open mingw64.exe
run cd Assignment_1/build

run ./analyze ../config.toml Assignment1/doc.txt --pos to generate POS tagging (without replacement) text file

1
2
3

Running POS-tagging with replace = false
Loading tagging model
 -> file saved as Assignment1/doc.pos-tagged.txt

cd Assignment1
run wc < doc.pos-tagged.txt | sed 's/^\s*//g' > doc.pos-tagged.txt.wc
submit

Task 3.5: Writing a New Function (15 pts)

Now that you have explored some of the basic text analysis techniques in analyze, you will add a new function to analyze that does stopword removal followed by stemming on the same text document.

MeTA implements the different text analysis techniques such as stemming and stopword removal as filters that can be chained together. See MeTA’s filters Reference for a list of the different supported filters.

To implement your new function, edit Assignment_1/analyze.cpp. Examine and try to understand how the functions stem and stop are performing tokenization and filtering.

Find the function:

1 2	void stopstem(const std::string& file, const cpptoml::table& config)

Your task is to complete the function in order to perform the required task. In the place where we wrote:

1 2	// Insert the line required to do stopword removal here // Insert the line required to do stemming here (using Porter2 Stemmer)

You should only add the stopword removal filter and the Porter2 stemming filter. (2 Lines in total) Do not add any other filters.

Since source code is changed, you should recompile after writing your code:

1
2
3

cd Assignment_1/build
cmake .. -DCMAKE_BUILD_TYPE=Release; make -j8
./analyze ../config.toml Assignment1/doc.txt --stopstem

and we still ask you to run

1	wc < doc.stopstem.txt \| sed 's/^\s*//g' > doc.stopstem.txt.wc

and submit your result.

solution

add codes in analyze.cpp

// Insert the line required to do stopword removal here
stream = make_unique<filters::list_filter>(std::move(stream), *stopwords);
// Insert the line required to do stemming here (using Porter2 Stemmer)
stream = make_unique<filters::porter2_filter>(std::move(stream));

run

1
2
3

cd Assignment_1/build
cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Release
make -j8

it appears

[ 97%] Built target relevance-judgements
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:45:9: error: 'float pl2_ranker::score_one(const meta::index::score_data&)' marked 'override', but does not override
   float score_one(const index::score_data&) override;
         ^~~~~~~~~
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp: In member function 'float pl2_ranker::score_one(const meta::index::score_data&)':
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:69:9: warning: unused variable 'doc_len' [-Wunused-variable]
   float doc_len = sd.idx.doc_size(sd.d_id);  // Current document length
         ^~~~~~~
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:70:9: warning: unused variable 'avg_dl' [-Wunused-variable]
   float avg_dl = sd.avg_dl;      // Average document length in the corpus
         ^~~~~~
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:71:9: warning: unused variable 'tf' [-Wunused-variable]
   float tf = sd.doc_term_count;  // Raw term count in the document
         ^~
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:72:9: warning: unused variable 'pi' [-Wunused-variable]
   float pi = 3.14;               // Use this for pi - Do NOT use other values
         ^~
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:73:9: warning: unused variable 'lambda' [-Wunused-variable]
   float lambda = lambda_;        // pl2's parameter
         ^~~~~~
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:74:9: warning: unused variable 'c' [-Wunused-variable]
   float c = c_;                  // pl2's parameter
         ^
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp: In function 'void pl2_tune(const std::shared_ptr<meta::index::cached_index<meta::index::inverted_index, meta::caching::default_dblru_cache> >&, std::vector<meta::corpus::document>&, meta::index::ir_eval&, float&, float&)':
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:127:48: error: no matching function for call to 'meta::index::ir_eval::avg_p(std::vector<meta::index::search_result>&, meta::doc_id, int)'
         eval.avg_p(ranking, (*query).id(), 1000);  // eval.avg_p stores the
                                                ^
In file included from C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:14:0:
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:91:12: note: candidate: double meta::index::ir_eval::avg_p(const result_type&, meta::query_id, uint64_t)
     double avg_p(const result_type& results, query_id q_id,
            ^~~~~
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:91:12: note:   no known conversion for argument 2 from 'meta::doc_id {aka meta::util::numerical_identifier<meta::doc_id_tag, long long unsigned int>}' to 'meta::query_id {aka meta::util::numerical_identifier<meta::query_id_tag, long long unsigned int>}'
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp: In function 'int main(int, char**)':
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:241:43: error: no matching function for call to 'meta::index::ir_eval::avg_p(std::vector<meta::index::search_result>&, meta::doc_id, int)'
       eval.avg_p(ranking, query.id(), 1000);
                                           ^
In file included from C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:14:0:
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:91:12: note: candidate: double meta::index::ir_eval::avg_p(const result_type&, meta::query_id, uint64_t)
     double avg_p(const result_type& results, query_id q_id,
            ^~~~~
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:91:12: note:   no known conversion for argument 2 from 'meta::doc_id {aka meta::util::numerical_identifier<meta::doc_id_tag, long long unsigned int>}' to 'meta::query_id {aka meta::util::numerical_identifier<meta::query_id_tag, long long unsigned int>}'
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp: In lambda function:
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:264:58: error: no matching function for call to 'meta::index::ir_eval::precision(std::vector<meta::index::search_result>&, meta::doc_id, int)'
                 << eval.precision(ranking, query.id(), 10) << std::endl;
                                                          ^
In file included from C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:14:0:
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:48:12: note: candidate: double meta::index::ir_eval::precision(const result_type&, meta::query_id, uint64_t) const
     double precision(const result_type& results, query_id q_id,
            ^~~~~~~~~
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:48:12: note:   no known conversion for argument 2 from 'meta::doc_id {aka meta::util::numerical_identifier<meta::doc_id_tag, long long unsigned int>}' to 'meta::query_id {aka meta::util::numerical_identifier<meta::query_id_tag, long long unsigned int>}'
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:268:43: error: no matching function for call to 'meta::index::ir_eval::avg_p(std::vector<meta::index::search_result>&, meta::doc_id, int)'
       eval.avg_p(ranking, query.id(), 1000);
                                           ^
In file included from C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:14:0:
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:91:12: note: candidate: double meta::index::ir_eval::avg_p(const result_type&, meta::query_id, uint64_t)
     double avg_p(const result_type& results, query_id q_id,
            ^~~~~
C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:91:12: note:   no known conversion for argument 2 from 'meta::doc_id {aka meta::util::numerical_identifier<meta::doc_id_tag, long long unsigned int>}' to 'meta::query_id {aka meta::util::numerical_identifier<meta::query_id_tag, long long unsigned int>}'
In file included from C:/msys64/mingw64/include/c++/6.3.0/bits/locale_conv.h:41:0,
                 from C:/msys64/mingw64/include/c++/6.3.0/locale:43,
                 from C:/msys64/mingw64/include/c++/6.3.0/iomanip:43,
                 from C:/msys64/home/SSQ/meta/deps/cpptoml/include/cpptoml.h:15,
                 from C:/msys64/home/SSQ/meta/include/meta/corpus/metadata.h:17,
                 from C:/msys64/home/SSQ/meta/include/meta/corpus/document.h:18,
                 from C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:13:
C:/msys64/mingw64/include/c++/6.3.0/bits/unique_ptr.h: In instantiation of 'typename std::_MakeUniq<_Tp>::__single_object std::make_unique(_Args&& ...) [with _Tp = pl2_ranker; _Args = {}; typename std::_MakeUniq<_Tp>::__single_object = std::unique_ptr<pl2_ranker, std::default_delete<pl2_ranker> >]':
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:110:41:   required from here
C:/msys64/mingw64/include/c++/6.3.0/bits/unique_ptr.h:791:30: error: invalid new-expression of abstract class type 'pl2_ranker'
     { return unique_ptr<_Tp>(new _Tp(std::forward<_Args>(__args)...)); }
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:26:7: note:   because the following virtual functions are pure within 'pl2_ranker':
 class pl2_ranker : public index::ranker {
       ^~~~~~~~~~
In file included from C:/msys64/home/SSQ/meta/include/meta/index/eval/ir_eval.h:19:0,
                 from C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:14:
C:/msys64/home/SSQ/meta/include/meta/index/ranker/ranker.h:220:40: note:       virtual std::vector<meta::index::search_result> meta::index::ranker::rank(meta::index::ranker_context&, uint64_t, const filter_function_type&)
     virtual std::vector<search_result> rank(ranker_context& ctx,
                                        ^~~~
In file included from C:/msys64/mingw64/include/c++/6.3.0/bits/locale_conv.h:41:0,
                 from C:/msys64/mingw64/include/c++/6.3.0/locale:43,
                 from C:/msys64/mingw64/include/c++/6.3.0/iomanip:43,
                 from C:/msys64/home/SSQ/meta/deps/cpptoml/include/cpptoml.h:15,
                 from C:/msys64/home/SSQ/meta/include/meta/corpus/metadata.h:17,
                 from C:/msys64/home/SSQ/meta/include/meta/corpus/document.h:18,
                 from C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:13:
C:/msys64/mingw64/include/c++/6.3.0/bits/unique_ptr.h: In instantiation of 'typename std::_MakeUniq<_Tp>::__single_object std::make_unique(_Args&& ...) [with _Tp = pl2_ranker; _Args = {const double&, const double&}; typename std::_MakeUniq<_Tp>::__single_object = std::unique_ptr<pl2_ranker, std::default_delete<pl2_ranker> >]':
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:164:53:   required from here
C:/msys64/mingw64/include/c++/6.3.0/bits/unique_ptr.h:791:30: error: invalid new-expression of abstract class type 'pl2_ranker'
     { return unique_ptr<_Tp>(new _Tp(std::forward<_Args>(__args)...)); }
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:/msys64/mingw64/include/c++/6.3.0/bits/unique_ptr.h: In instantiation of 'typename std::_MakeUniq<_Tp>::__single_object std::make_unique(_Args&& ...) [with _Tp = pl2_ranker; _Args = {std::basic_istream<char, std::char_traits<char> >&}; typename std::_MakeUniq<_Tp>::__single_object = std::unique_ptr<pl2_ranker, std::default_delete<pl2_ranker> >]':
C:/msys64/home/SSQ/meta/include/meta/index/ranker/ranker_factory.h:174:31:   required from 'std::unique_ptr<meta::index::ranker, std::default_delete<meta::index::ranker> > meta::index::load_ranker(std::istream&) [with Ranker = pl2_ranker; std::istream = std::basic_istream<char>]'
C:/msys64/home/SSQ/meta/include/meta/index/ranker/ranker_factory.h:188:5:   required from 'void meta::index::register_ranker() [with Ranker = pl2_ranker]'
C:/msys64/home/SSQ/Assignment_1/ranking-experiment.cpp:170:38:   required from here
C:/msys64/mingw64/include/c++/6.3.0/bits/unique_ptr.h:791:30: error: invalid new-expression of abstract class type 'pl2_ranker'
make[2]: *** [CMakeFiles/ranking-experiment.dir/build.make:63：CMakeFiles/ranking-experiment.dir/ranking-experiment.cpp.obj] 错误 1
make[1]: *** [CMakeFiles/Makefile2:138：CMakeFiles/ranking-experiment.dir/all] 错误 2
make[1]: *** 正在等待未完成的任务....
[ 97%] Linking CXX executable analyze.exe
[ 97%] Built target analyze
make: *** [Makefile:130：all] 错误 2

run ./analyze ../config.toml Assignment1/doc.txt --stopstem
it appears

1 2	Running stopword removal and stemming -> file saved as Assignment1/doc.stopstem.txt

run cd Assignment1
run wc < doc.stopstem.txt | sed 's/^\s*//g' > doc.stopstem.txt.wc
submit

Build the Search Engine

In this section, you will perform and experiment with indexing and retrieval on a special dataset called the MOOCs dataset.

Exploring the Dataset

The MOOCs dataset contains the descriptions found on the webpages of around 23,000 MOOCs (Massive Open Online Courses). You will start by exploring the dataset. Navigate to Assignment_1/moocs. This directory contains the files that describe the dataset.

moocs.dat contains the content of the webpages of all the MOOCs; each MOOC’s main page occupies exactly one line in the file. Feel free to open the file to get an idea about the contents, but be wary that it is a large file and might take some time to load.
metadata.dat contains the names and the URLs of the MOOCs. The entry on line x in metadata.dat corresponds to the MOOC on line x in moocs.dat.
moocs-queries.txt contains a set of queries that you will use to evaluate the effectiveness of your search engine.
moocs-qrels.txt contains the relevance judgments corresponding to the queries in moocs-queries.txt. Each line in moocs-qrels.txt has the following format: (querynum documentID 1). This means that the document represented by documentID is a relevant document for the query whose number is querynum. The relevance judgments in moocs-qrels.txt were created by human assessors who ran the queries and chose the relevant documents. Later on in the assignment, you are going to use these judgments to quantify the performance of your search engine.

Indexing

Now that you have an idea about tokenization, text filtering, and the dataset to be used, you will proceed to index the MOOCs dataset. In this process, an inverted index will be created. As you have seen in the lectures, the inverted index is a data structure that supports efficient retrieval of documents by allowing the lookup of all the documents that contain a specific term.

Before you proceed to index the dataset, we encourage you to open config.toml and examine the settings that we have already configured. For instance, the snippet shown below tells the indexer where to find the MOOCs dataset, specifies that it is a line corpus (i.e., each document is on one line), and defines the name of the inverted index to be created. The forward index will not be used in this assignment.

query-judgements = "../../meta/data/moocs/moocs-qrels.txt" 
querypath = "../../meta/data/moocs/" 
corpus = "line.toml" 
dataset = "moocs" 
forward-index = "moocs-fwd" 
inverted-index = "moocs-inv"

You should also have a look at another important snippet:

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-unigram-chain"

The settings under the analyzers tag control how tokenization and filtering are performed, prior to creating the inverted index. The tokenizer will segment each document into unigrams. MeTA uses its default filtering chain, which performs a couple of predefined filters including lower case conversion, length filtering (which discards tokens whose length is not within a certain range), and stemming. To read more about modifying META’s default tokenization and filtering behavior see MeTA’s analyzers and filters page.

To index the dataset, in the Assignment_1/build directory run:

1	../../meta/build/index ../config.toml

if it occurs error index name missing from configuration file
add index = "moocs-inv" in your config.toml file after inverted-index = "moocs-inv"

 > Counting lines in file: [=================================] 100% ETA 00:00:00
1498050577: [info]     Creating index: moocs-inv/inv (C:/msys64/home/SSQ/meta/src/index/inverted_index.cpp:119)
 > Tokenizing Docs: [========================================] 100% ETA 00:00:00
 > Merging: [================================================] 100% ETA 00:00:00
1498050585: [info]     Created uncompressed postings file moocs-inv/inv/postings.index (9.120000 MB) (C:/msys64/home/SSQ/meta/src/index/inverted_index.cpp:148)
 > Compressing postings: [===================================] 100% ETA 00:00:00
1498050586: [info]     Created compressed postings file (7.620000 MB) (C:/msys64/home/SSQ/meta/src/index/inverted_index.cpp:279)
1498050586: [info]     Done creating index: moocs-inv/inv (C:/msys64/home/SSQ/meta/src/index/inverted_index.cpp:166)
Number of documents: 17972
Avg Doc Length: 358.238
Unique Terms: 160615
Index generation took: 11.217 seconds

(index program is in the meta/build directory under root). You should see something like below:

$ ./meta/build/index ../config.toml
1463154868: [info]     Loading index from disk: moocs-inv (root/meta/src
    /index/inverted_index.cpp:178)
Number of documents: 23566
Avg Doc Length: 383.014
Unique Terms: 193614
Index generation took: 0.007 seconds

This will start by performing tokenization and applying the text filters defined in config.toml; it then creates the inverted index and places it in/meta/build/moocs-inv. When the program finishes execution, you should get a summary of the indexed corpus as above

Searching

After creating the inverted index, you can efficiently search the MOOCs dataset. The ranking function to be used in retrieving documents is defined in config.toml. Open config.toml and look for the ranker tag. Under this tag, you will see that the default ranker is “bm25” along with values assigned to its three parameters. To test your search engine, you can use the interactive-search program provided with MeTA:

1 2	cd Assignment_1/build ../../meta/build/interactive-search ../config.toml

Enter any query you want, and the top results should show up instantaneously. Feel free to experiment with different queries to explore the contents of the dataset.

1498051690: [info]     Loading index from disk: moocs-inv/inv (C:/msys64/home/SSQ/meta/src/index/inverted_index.cpp:171)
Enter a query, or blank to quit.
> machine learning
Showing top 5 results (3ms)
1. [none] (score = 8.174631, docid = 2146)
2. [none] (score = 7.979603, docid = 12388)
3. [none] (score = 7.951803, docid = 4087)
4. [none] (score = 7.932771, docid = 7194)
5. [none] (score = 7.900622, docid = 10)
> text mining
Showing top 5 results (1ms)
1. [none] (score = 11.931686, docid = 11892)
2. [none] (score = 11.278910, docid = 11890)
3. [none] (score = 10.591638, docid = 10424)
4. [none] (score = 10.041771, docid = 1646)
5. [none] (score = 9.718287, docid = 5603)

When you finish, enter a blank query to exit.

In what follows you will evaluate the performance of the search engine using different ranking functions and parameters. The main evaluation measure to be used is MAP (mean average precision), which is the arithmetic mean of the average precision of all the queries being used for evaluation. We have provided you with a program called ranking-experiment which evaluates the MAP of any retrieval function you specify in config.toml.

Task 4: BM25 (5 pts)

Evaluate the performance of bm25 with default parameters (i.e., do not change config.toml) by running:

1 2	cd Assignment_1/build ./ranking-experiment ../config.toml task4

You should see a list of the top results corresponding to the different test queries, the precision at 10 for each query, and the final MAP value.

At the same time, we create a output file for you to submit: /Assignment_1/build/Assignment1/task4.txt

Submit task4.txt.

Task 5: BM25 with tuned parameters (10 pts)

Change the parameter b of bm25 in config.toml to 0.75

ranker] 
method = "bm25" 
k1 = 1.2 
b = 0.75 
k3 = 500

and run again:

1 2	cd Assignment_1/build ./ranking-experiment ../config.toml task5

and submit your /Assignment_1/build/Assignment1/task5.txt

Week 4

In this week’s lessons, you will learn probabilistic retrieval models and statistical language models, particularly the detail of the query likelihood retrieval function with two specific smoothing methods, and how the query likelihood retrieval function is connected with the retrieval heuristics used in the vector space model.

Week 4 Information

Week 4 Overview 10 min

Time

This module should take approximately 3 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 4 Video Lectures	2 hours
Week 4 Graded Quiz	1 hour

Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

Explain how to interpret p(R=1|q,d) and estimate it based on a large set of collected relevance judgments (or clickthrough information) about query q and document d.
https://www.coursera.org/learn/text-retrieval/lecture/nkg5n/lesson-4-1-probabilistic-retrieval-model-basic-idea
Explain how to interpret the conditional probability p(q|d) used for scoring documents in the query likelihood retrieval function.
Explain what a statistical language model and a unigram language model are.
Explain how to compute the maximum likelihood estimate of a unigram language model.
Explain how to use unigram language models to discover semantically related words.
Compute p(q|d) based on a given document language model p(w|d).
Explain what smoothing does.
Show that query likelihood retrieval function implements TF-IDF weighting if we smooth the document language model p(w|d) using the collection language model p(w|C) as a reference language model.
Compute the estimate of p(w|d) using Jelinek-Mercer (JM) smoothing and Dirichlet Prior smoothing, respectively.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

Given a table of relevance judgments in the form of three columns (query, document, and binary relevance judgments), how can we estimate p(R=1|q,d)?

https://www.coursera.org/learn/text-retrieval/lecture/nkg5n/lesson-4-1-probabilistic-retrieval-model-basic-idea =$\frac{𝑐𝑜𝑢𝑛𝑡(𝑞, 𝑑, 𝑅 = 1)}{
𝑐𝑜𝑢𝑛𝑡(𝑞, 𝑑)}$

How should we interpret the query likelihood conditional probability p(q|d)?

$q=w_1 w_2 … w_n. p(q,d) = p(w_1|d)\cdotp(w_2|d)…p(w_n|d)$

What is a statistical language model? What is a unigram language model? How many parameters are there in a unigram language model?

A probability distribution over word sequences.
word distribution.
$\frac{c(w,d)}{|d|}$

How do we compute the maximum likelihood estimate of the unigram language model (based on a text sample)?

$\frac{c(w,d)}{|d|}$

What is a background language model? What is a collection language model? What is a document language model?

p(w|B)
p(w|C)
p(w|d)

Why do we need to smooth a document language model in the query likelihood retrieval model? What would happen if we don’t do smoothing?

let the unseen word have probability.
probability of unseen word would be 0.

When we smooth a document language model using a collection language model as a reference language model, what is the probability assigned to an unseen word in a document?

$\alpha_d p(p|d)$

How can we prove that the query likelihood retrieval function implements TF-IDF weighting if we use a collection language model smoothing?

$p_{seen}(W_i|d)$ refers to TF weighting, $\alpha_d p(p|d)$ refers to IDF weighting.

How does linear interpolation (Jelinek-Mercer) smoothing work? What is the formula?

$p(w,d) = (1-\lambda)\frac{c(w,d)}{|d|}+\lambda p(w,d)$, $\lambda \in [0,1]$

How does Dirichlet prior smoothing work? What is the formula?

$p(w,d) = \frac{c(w,d)+\mu p(w|C)}{|d|+\mu} = \frac{|d|}{|d|+\mu}\frac{c(w,d)}{|d|}+\frac{\mu}{|d|+\mu}p(w,d)$, $\mu\in[0,\infty)$

What are the similarities and differences between Jelinek-Mercer smoothing and Dirichlet prior smoothing?

• Two smoothing methods
– Jelinek-Mercer: Fixed coefficient linear interpolation
– Dirichlet Prior: Adding pseudo counts; adaptive interpolation
• Both lead to state of the art retrieval functions with
assumptions clearly articulated (less heuristic)
– Also implementing TF-IDF weighting and doc length
normalization
– Has precisely one (smoothing) parameter

Additional Readings and Resources

C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 6 - Section 6.4

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

p(R=1|q,d) ; query likelihood, p(q|d)
Statistical and unigram language models
Maximum likelihood estimate
Background, collection, and document language models
Smoothing of unigram language models
Relation between query likelihood and TF-IDF weighting
Linear interpolation (i.e., Jelinek-Mercer) smoothing
Dirichlet Prior smoothing

Week 4 Lessons

Lesson 4.1: Probabilistic Retrieval Model - Basic Idea 12 min

https://www.coursera.org/learn/text-retrieval/lecture/nkg5n/lesson-4-1-probabilistic-retrieval-model-basic-idea

Lesson 4.2: Statistical Language Model 17 min

https://www.coursera.org/learn/text-retrieval/lecture/kv4Aj/lesson-4-2-statistical-language-model
• What is a Language Model?
• Unigram Language Model
• Uses of a Language Model
Language Model = probability distribution over text
• Unigram Language Model = word distribution
• Uses of a Language Model
– Representing topics
– Discovering word associations

Lesson 4.3: Query Likelihood Retrieval Function 12 min

https://www.coursera.org/learn/text-retrieval/lecture/BWexZ/lesson-4-3-query-likelihood-retrieval-function

Lesson 4.4: Statistical Language Model - Part 1 12 min

https://www.coursera.org/learn/text-retrieval/lecture/f4CYl/lesson-4-4-statistical-language-model-part-1

Lesson 4.5: Statistical Language Model - Part 2 9 min

https://www.coursera.org/learn/text-retrieval/lecture/hI1vE/lesson-4-5-statistical-language-model-part-2

Lesson 4.6: Smoothing Methods - Part 1 9 min

https://www.coursera.org/learn/text-retrieval/lecture/kM6Ie/lesson-4-6-smoothing-methods-part-1

Lesson 4.7: Smoothing Methods - Part 2 13 min

https://www.coursera.org/learn/text-retrieval/lecture/gxNMo/lesson-4-7-smoothing-methods-part-2

Week 4 Activities

Practice Quiz: Week 4 Practice Quiz 10 questions

PRACTICE QUIZ
Week 4 Practice Quiz
10 questions
To Pass70% or higher
Deadline
July 2, 11:59 PM PDT

1 point
1.You are given a vocabulary composed of only three words: “text,” “mining,” and “research.” Below are the probabilities of two of these three words given by a unigram language model:

Word	Probability
text	0.4
mining	0.2

What is the probability of generating the phrase “text mining research” using this unigram language model?

0.032
The probability of “research” is P(“research”) = 1 – ( P(“text”)+P(“mining”) ) = 1 – (0.4 + 0.2) = 0.4. The probability of generating the given phrase P(“text mining research”) = P(“text”) x P(“mining”) x P(“research”) = 0.4 x 0.2 x 0.4 = 0.032.

0.4

0.08

1 point
2.You are given the query Q= “food safety” and two documents:

D1 = “food quality regulations”

D2 = “food safety measures”

Assume you are using the maximum likelihood estimator without smoothing to calculate the probabilities of words in documents (i.e., the estimated p(w|D) is the relative frequency of word w in the document D). Based on the unigram query likelihood model, which of the following choices is correct?

P(Q|D1) = 1/3 P(Q|D2) = 0

P(Q|D1) = 1/2 P(Q|D2) = 1/2

P(Q|D1) = 0 P(Q|D2) = 1/9

P(Q|D1) = 1/3 P(Q|D2) = 1/9

1 point
3.Probability smoothing avoids assigning zero probabilities to unseen words in documents.

True

False

1 point
4.Assume you are given two scoring functions:

S1(Q,D)=P(Q|D)
S2(Q,D)=logP(Q|D)
For the same query and corpus, S1 and S2 will give the same ranked list of documents.

True

False

1 point
5.Assume you are using linear interpolation (Jelinek-Mercer) smoothing to estimate the probabilities of words in a certain document. What happens to the smoothed probability of the word when the parameter λ is decreased?

It becomes closer to the maximum likelihood estimate of the probability derived from the document.

It becomes closer to the probability of the word in the collection language model.

It does not change.

1 point
6.Refer to the Rocchio feedback formula in the lectures. If you want to reduce the effect of the relevant documents in the updated query, which of the following should be done?

Increase γ

Reduce β
The weight assigned to the centroid of the relevant documents is directly proportional to β.

Increase β

Reduce γ

1 point
7.Let q be the original query vector, DR={P1,…,Pn} be the set of positive document vectors, and DN={N1,…,Nm} be the set of negative document vectors. Let q1 be the expanded query vector after applying Rocchio on DR and DN with positive parameter values α, β, and γ. Let q2 be the expanded query vector after applying Rocchio on DR and DN with the same values for α, β, but γ being set to zero.

In which updated query do you expect stopwords to have higher weights?

q1

1 point
8.In the query likelihood model, why is smoothing necessary?

Without smoothing, if a term in the document is not in the query, then the likelihood becomes −∞.

Without smoothing, if a term is neither in the query nor the document, then the likelihood becomes −∞.

Without smoothing, if a term in the query is not in the document, then the likelihood becomes −∞.

1 point
9.Which of the following is NOT correct about the unigram model?

The probability of generating the word sequence “A” “B” “C” is the same as generating “C” “B” “A.”

The probability of generating the words A AND B is the product of the probability of generating A and the probability of generating B.

The probability of generating the word A OR B is the sum of the probability of generating A and the probability of generating B.

1 point
10.Assume that β=1 is a good choice when performing relevance feedback using Rocchio’s method. What is a reasonable value of β to use when relying on pseudo feedback?

Less than 1
When doing relevance feedback, the judgments are usually reliable since human assessors generate them after reading the queries and documents. However, in pseudo feedback, the top k documents retrieved by the system are “blindly” assumed to be relevant, which makes the judgments less reliable compared to relevance feedback. The reasonable choice is to lower the parameter β, which can be thought of as the degree of “confidence” in the documents being used as “positive” examples in feedback.

More than 1

Quiz: Week 4 Quiz10 questions Due July 2, 11:59 PM PDT

QUIZ
Week 4 Quiz
10 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
July 2, 11:59 PM PDT

1 point
1.Assume you are using a unigram language model to calculate the probabilities of phrases. Then, the probabilities of generating the phrases “study text mining” and “text mining study” are not equal, i.e., P(“study text mining”) ≠ P(“text mining study”).

True

False

1 point
2.You are given a vocabulary composed of only four words: “the,” “computer,” “science,” and “technology.” Below are the probabilities of three of these four words given by a unigram language model.

Word	Probability
the	0.4
computer	0.2
science	0.3

What is the probability of generating the phrase “the technology” using this unigram language model?

0.0024

0.04

0.1

0.5

1 point
3.You are given the query Q= “online courses” and two documents:

D1 = “online courses search engine”

D2 = “online education is affordable”

Assume you are using the maximum likelihood estimator without smoothing to calculate the probabilities of words in documents (i.e., the estimated p(w|D) is the relative frequency of the word w in the document D). Based on the unigram query likelihood model, which of the following choices is correct?

P(Q|D1) = 0 P(Q|D2) = 1/4

P(Q|D1) = 1/16 P(Q|D2) = 1/4

P(Q|D1) = 1/2 P(Q|D2) = 1/2

P(Q|D1) = 1/16 P(Q|D2) = 0

1 point
4.Assume the same scenario as in Question 3, but using linear interpolation (Jelinek-Mercer) smoothing with λ=0.5. Furthermore, you are given the following probabilities of some of the words in the collection language model:

Word	P(w	C)
online	1/4
courses	1/4
education	1/8

Based on the unigram query likelihood model, which of the following choices is correct?

$P(Q|D1) = \sum_{i}^N p(w_i|D1) = p(“online”|D1)\cdot p(“courses”|D1) = {0.5\frac{1}{4} + 0.5\frac{1}{4}}\cdot{0.5\frac{1}{4} + 0.5\frac{1}{4}} = 1/16$

$P(Q|D2) = \sum_{i}^N p(w_i|D2) = p(“online”|D2)\cdot p(“courses”|D2) = {0.5\frac{1}{4} + 0.5\frac{1}{4}}\cdot{0.5\frac{0}{4} + 0.5\frac{1}{4}} = 1/32$

P(Q|D1) = 1/32 P(Q|D2) = 1/32

P(Q|D1) = 1/16 P(Q|D2) = 0

P(Q|D1) = 1/16 P(Q|D2) = 1/32

P(Q|D1) = 1/16 P(Q|D2) = 1/16

1 point
5.If word count for every term doubles in one document:

If not using any smoothing, query likelihood would change for some queries.

p(w|d) remains the same if using Jelinek-Mercer smoothing.

p(w|d) remains the same if using Dirichlet-prior smoothing.

1 point
6.Assume you are using Dirichlet Prior smoothing to estimate the probabilities of words in a certain document. What happens to the smoothed probability of the word when the parameter μ is increased?

It becomes closer to the probability of the word in the collection language model.

It does not change.

It becomes closer to the maximum likelihood estimate of the probability derived from the document.

It tends to 1.

1 point
7.It is possible that pseudo feedback decreases the precision and recall of a certain retrieval system.

False

True

1 point
8.Refer to the Rocchio feedback formula in the lectures. If you want to eliminate the effect of non-relevant documents when doing feedback, which of the following parameters must be set to zero?

γ and β

β

1 point
9.Let q be the original query vector, DR={P1,…,Pn} be the set of positive document vectors, and DN={N1,…,Nm} be the set of negative document vectors. Let q1 be the expanded query vector after applying Rocchio on DR and DN with positive parameter values α, β, and γ. Let q2 be the expanded query vector after applying Rocchio on DR and DN with the same values for α, β, but γ being set to zero. Which of the following is correct?

q2 can have greater or equal weights to q1 for each dimension.

q1 can have greater or equal weights to q2 for each dimension.

q1 has strictly greater weights than q2 for each dimension.

q2 has strictly greater weights than q1 for each dimension.

1 point
10.Which of the following is not true about the KL-divergence retrieval model?

It represents both queries and documents as language models.

It supports relevance feedback.

It cannot be computed as efficiently as the query likelihood model.

Week 5

In this week’s lessons, you will learn feedback techniques in information retrieval, including the Rocchio feedback method for the vector space model, and a mixture model for feedback with language models. You will also learn how web search engines work, including web crawling, web indexing, and how links between web pages can be leveraged to score web pages.

Week 5 Information

Week 5 Overview 10 min

Time

This module should take approximately 3 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 5 Video Lectures	2 hours
Week 5 Graded Quiz	1 hour

Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

Explain the similarity and differences in the three different kinds of feedback, i.e., relevance feedback, pseudo-relevance feedback, and implicit feedback.
Explain how the Rocchio feedback algorithm works.
Explain how the Kullback-Leibler (KL) divergence retrieval function generalizes the query likelihood retrieval function.
Explain the basic idea of using a mixture model for feedback.
Explain some of the main general challenges in creating a web search engine.
Explain what a web crawler is and what factors have to be considered when designing a web crawler.
Explain the basic idea of Google File System (GFS).
Explain the basic idea of MapReduce and how we can use it to build an inverted index in parallel.
Explain how links on the web can be leveraged to improve search results.
Explain how PageRank and HITS algorithms work.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

What is relevance feedback? What is pseudo-relevance feedback? What is implicit feedback?

Users make explicit relevance judgments on the initial results
(judgments are reliable, but users don’t want to make extra effort)
Top-k initial results are simply assumed to be relevant
(judgments aren’t reliable, but no user activity is required)
User-clicked docs are assumed to be relevant; skipped ones non-relevant
(judgments aren’t completely reliable, but no extra effort from users)

How does Rocchio work? Why do we need to ensure that the original query terms have sufficiently large weights in feedback?

Drag the query to the centroid of relevant of docments.$$\vec{q_m} = \alpha \vec{q} + \frac{\beta}{|D_r|}\sum_{\forall\vec{d_j} \in D_r}^{}{\vec{d_j}} - \frac{\gamma}{|D_n|}\sum_{\forall \vec{d_j} \in D_n}^{}{\vec{d_j}}$$
Avoid “over-fitting”.

What is the KL-divergence retrieval function? How is it related to the query likelihood retrieval function?

KL-divergence(cross entropy): $$f(q,d) = \sum_{w\in d,p(w|\theta_Q)>0}{}{[p(w|\hat{\theta_Q})log\frac{P_{seen}(w|d)}{\alpha_d p(w|C)}] + log\alpha_d}$$
Query Likelihood: $$f(q,d) = \sum_{w_i \in d, w_i \in q}{}{c(w|q)[log\frac{P_{seen}(w|d)}{\alpha_d p(w|C)}] + n log\alpha_d}$$

What is the basic idea of the two-component mixture model for feedback?

High probability of topic words when feedback is common and background words are rare.

What are some of the general challenges in building a web search engine?

Scalability
How to handle the size of the Web and ensure completeness of coverage?
How to serve many user queries quickly?
Low quality information and spams
Dynamics of the Web
New pages are constantly created and some pages may be updated very quickly

What is a crawler? How can we implement a simple crawler?

crawler
– Start with a set of “seed pages” in a priority queue
– Fetch pages from the web
– Parse fetched pages for hyperlinks; add them to the queue
– Follow the hyperlinks in the queue
how
– Robustness (server failure, trap, etc.)
– Crawling courtesy (server load balance, robot exclusion, etc.)
– Handling file types (images, PDF files, etc.)
– URL extensions (cgi script, internal references, etc.)
– Recognize redundant pages (identical and duplicates)
– Discover “hidden” URLs (e.g., truncating a long URL )

What is focused crawling? What is incremental crawling?

Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers.
The process of prioritizing and revisiting URLs is usually referred to as incremental crawling.

What kind of pages should have a higher priority for recrawling in incremental crawling?
What can we do if the inverted index doesn’t fit in any single machine?
What’s the basic idea of the Google File System (GFS)?

GFS is made up of several storage systems built from low-cost commodity hardware components.
一种软件定义存储
把所有服务器上的存储设备“聚合起来”变成一个分布式的文件系统
当一台服务器上插满硬盘也不能满足存储需求的时候，软件定义储存就诞生了
对外展示为一个文件系统，底层是很多台机器的很多硬盘
有一种是对外表现为一个硬盘，（可以再分区什么的）
有一种是对外表现为一个文件系统（分布式的，可以多个主机同时操作）
还有一种是需要写程序才能用的
还有一种是同时具备以上功能
gfs应该是第二种

How does MapReduce work? What are the two key functions that a programmer needs to implement when programming with a MapReduce framework?

Software framework for parallel computation.
Map and Reduce.

How can we use MapReduce to build an inverted index in parallel?

Parallel map via lots of documents and reduce according to the “key”.

What is anchor text? Why is it useful for improving search accuracy?

Anchor Text is the visible, clickable text in a hyperlink. In modern browsers, it is often blue and underlined, such as this link to the SSQ homepage.
“Extra text”/summary for a doc

What is a hub page? What is an authority page?

A page has lots of outlinks
A page has lots of inlinks

What kind of web pages tend to receive high scores from PageRank?

A page that is cited often

How can we interpret PageRank from the perspective of a random surfer “walking” on the Web?

Computing the probability of visiting every page.

How exactly do you compute PageRank scores?
Random surfing model: At any page,
With prob. $\alpha$, randomly jumping to another page
With prob. (1-$\alpha$), randomly picking a link to follow.
p(di): PageRank score of di = average probability of visiting page di

two forms $$p(d_j) = \sum_{i = 1}^{N}{[\frac{1}{N}\alpha + (1-\alpha)M_{ij}]p(d_i)}$$ or $$\bar{p} = {(\alpha I + (1 - \alpha) M)}^T \bar{p}$$
M: transition matrix
Mij = probability of going from di to dj
$\sum_{j=1}{N}{M_{ij}=1}$

How does the HITS algorithm work?

Get the adjacency matrix
Initial values: $a(d_i) = h(d_i) = 1$
Using $$h(d_i) = \sum_{d_j \in OUT(d_i)}{}{a(d_j)}$$ $$a(d_i) = \sum_{d_j \in IN(d_i)}{}{h(d_j)}$$ i.e. $$\bar{h} = A \bar{a}, \bar{a} = A^T \bar{h}, \bar{h} = AA^T\bar{h}, \bar{a} = A^TA\bar{a}$$
Iterate and normalize: $$\sum_{i}{}{a(d_i)^2} = \sum_{i}{}{h(d_i)^2} = 1$$

Additional Readings and Resources

C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 7 & 10

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Relevance feedback
Pseudo-relevance feedback
Implicit feedback
Rocchio feedback
Kullback-Leiber divergence (KL-divergence) retrieval function
Mixture language model
Scalability and efficiency
Spams
Crawler, focused crawling, and incremental crawling
Google File System (GFS)
MapReduce
Link analysis and anchor text
PageRank and HITS

Week 5 Lessons

Lesson 5.1: Feedback in Text Retrieval6 min

https://www.coursera.org/learn/text-retrieval/lecture/gw3fo/lesson-5-1-feedback-in-text-retrieval

Relevance Feedback

User

Pseudo/Blind/Automatic Feedback

Top k

Implicit Feedback

Clickthroughs

Lesson 5.2: Feedback in Vector Space Model - Rocchio 12 min

https://www.coursera.org/learn/text-retrieval/lecture/PyTkW/lesson-5-2-feedback-in-vector-space-model-rocchio

Rocchio Feedback

Lesson 5.3: Feedback in Text Retrieval - Feedback in LM 19 min

https://www.coursera.org/learn/text-retrieval/lecture/M7ylk/lesson-5-3-feedback-in-text-retrieval-feedback-in-lm

Kullback-Leibler (KL) Divergence Retrieval Model
Generative Mixture Model

Lesson 5.4: Web Search: Introduction & Web Crawler 11 min

https://www.coursera.org/learn/text-retrieval/lecture/qkTHD/lesson-5-4-web-search-introduction-web-crawler

Web Search: Challenges & Opportunities
Basic Search Engine Technologies
Major Crawling Strategies
Web search is one of the most important applications of text retrieval
Crawler is an essential component of Web search applications

Lesson 5.5: Web Indexing 17 min

https://www.coursera.org/learn/text-retrieval/lecture/lRm0I/lesson-5-5-web-indexing

GFS Architecture
MapReduce: A Framework for Parallel Programming

Lesson 5.6: Link Analysis - Part 1 9 min

https://www.coursera.org/learn/text-retrieval/lecture/nE8nq/lesson-5-6-link-analysis-part-1

anchor text
hub and authority
PageRank

Lesson 5.7: Link Analysis - Part 2 17 min

https://www.coursera.org/learn/text-retrieval/lecture/GUQ1Q/lesson-5-7-link-analysis-part-2

The PageRank Algorithm: : Capturing Page “Popularity”
PageRank: Example

Lesson 5.8: Link Analysis - Part 3 5 min

https://www.coursera.org/learn/text-retrieval/lecture/d6INf/lesson-5-8-link-analysis-part-3
The key idea of HITS (Hypertext-Induced Topic Search)
– Good authorities are cited by good hubs
– Good hubs point to good authorities
– Iterative reinforcement…
The HITS Algorithm: Capturing Authorities & Hubs

Week 5 Activities

Practice Quiz: Week 5 Practice Quiz 10 questions

PRACTICE QUIZ
Week 5 Practice Quiz
10 questions
To Pass70% or higher
Deadline
July 9, 11:59 PM PDT
You can still pass this quiz before the course ends.

1 point
1.In PageRank, what is NOT the benefit of introducing random jumping?

Otherwise for zero-outlink nodes will receive all the probability

Otherwise PageRank will favor nodes with fewer incoming links

Otherwise disconnected page always has zero probability

1 point
2.Modern web search engines often combine many features (e.g., content-based scores, link-based scores) to rank documents.

True

False

1 point
3.HITS and PageRank only use the inter-document links when calculating a document’s score, without considering the content of the document.

True

False

1 point
4.Can a crawler that only follows hyperlinks identify hidden pages that do not have any incoming links?

Yes

No

1 point
5.Which one is NOT a feature of incremental crawling?

Target at a subset of pages (e.g., all pages about “automobiles’’)

Target at frequently accessed pages

Target at frequently updated pages

Can learn from past experience (updated daily vs. monthly)

1 point
6.Which of the following sites will get a larger value in PageRank?

A site that has many others linking to it but it does not link to others

A site that outlinks to many others but none others link to it

1 point
7.Which of the following is NOT a reason that PageRank is efficient?

Dimension of matrix M is usually large

Matrix M is usually sparse

Each iteration of PageRank is simple (matrix multiplication)

1 point
8.Which algorithm has more parameters than the other for running on the same dataset?

PageRank

HITS

1 point
9.If we analyze a large network in Facebook, which algorithms will find both zoombie users and celebrity users?

PageRank

HITS

1 point
10.Which of following feedback does not involve humans?

Pseudo

Implicit

Relevance

Quiz: Week 5 Quiz 10 questions

QUIZ
Week 5 Quiz
10 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
July 9, 11:59 PM PDT
You can still pass this quiz before the course ends.

1 point
1.Which of the following is not true about GFS?

The GFS keeps multiple replicas of the same file chunk.

The file data transfer happens directly between the GFS client and the GFS master.

The file data transfer happens directly between the GFS client and the GFS chunkservers.

1 point
2.MapReduce allows parallelizing the creation of the inverted index.

True

False

1 point
3.In MapReduce, the Reduce function is called for each unique key of the output key-value pairs from the Map function.

False

True

1 point
4.Which of the following would cause a web page P to have a higher PageRank score?

Add to another page Q a link that points to page P

Add to page P a link that points to another page Q

1 point
5.Imagine if the web is fully connected with N pages such that for any pair of pages, P and Q, there exists a link from P to Q, then which of the following is true?

At least one page will have a PageRank score larger than 1/N.

All the pages will have a PageRank score of 1/N.

At least one page will have a PageRank score smaller than 1/N.

1 point
6.Which of the following is/are the difference(s) between pseudo and implicit feedback?

Both methods do not involve user activity.

Implicit feedback assumes user-clicked documents are relevant.

Pseudo feedback assumes top ranked documents are relevant.

1 point
7.What is true about Rocchio feedback?

Negative examples are as important as positive examples.

It works the best if you discard original query weights during feedback.

It can be used for relevance feedback and also pseudo feedback.

All words are used (no truncation) so that performance and efficiency can be guaranteed.

1 point
8.The advantages of incremental crawling are:

It targets frequently accessed pages.

It learns from past experience.

It adds overhead.

It targets frequently updated pages.

1 point
9.What is NOT the reason that PageRank works better than “citation counting”?

It utilizes text information other than link information.

PageRank has smoothing.

PageRank considers “indirect citations.”

1 point
10.What is WRONG about HITS algorithm?

The adjacency matrix is symmetric.

It has two sets of parameters: a (authority) and h (hub).

It’s an iterative algorithm.

Week 6

In this week’s lessons, you will learn how machine learning can be used to combine multiple scoring factors to optimize ranking of documents in web search (i.e., learning to rank), and learn techniques used in recommender systems (also called filtering systems), including content-based recommendation/filtering and collaborative filtering. You will also have a chance to review the entire course.

Week 6 Information

Week 6 Overview 10 min

Time

This module should take approximately 3-6 hours of dedicated time to complete, with its videos and assignments.

Activities

The activities for this module are listed below (with assignments in bold):

Activity	Estimated Time Required
Week 6 Video Lectures	2 hours
Week 6 Graded Quiz	1 hour
Optional Programming Assignment 2	3 hours

Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

Explain the basic idea of using machine learning to combine multiple features for ranking documents (i.e., learning to rank).
Explain how we can extend a retrieval system to perform content-based information filtering (recommendation).
Explain how we can use a linear utility function to evaluate an information filtering system.
Explain the basic idea of collaborative filtering.
Explain how the memory-based collaborative filtering algorithm works.

Guiding Questions

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

What’s the basic idea of learning to rank?

General idea:
– Given a query-doc pair (Q,D), define various kinds of features Xi(Q,D)
– Examples of feature: the number of overlapping terms, BM25 score of Q and D, p(Q|D), PageRank of D, p(Q|Di), where Di may be anchor text or big font text, “does the URL contain ‘~’?”….
– Hypothesize p(R=1|Q,D)=s(X1(Q,D),…,Xn(Q,D), $\lambda$) where $\lambda$ is a
set of parameters
– Learn $\lambda$ by fitting function s with training data, i.e., 3-tuples like (D, Q, 1) (D is relevant to Q) or (D,Q,0) (D is non-relevant to Q)

How can logistic regression be used to combine multiple features for improving ranking accuracy of a search engine?

Logistic Regression: Xi(Q,D) is feature; $\beta$’s are parameters; Estimate $\beta$’s by maximizing the likelihood of training data; Once $\beta$’s are known, we can take Xi(Q,D) computed based on a new query and a new document to generate a score for D w.r.t. Q.

What is content-based information filtering?

Look at what items U likes, and then check if X is similar. Item similarity => content-based filtering

How can we use a linear utility function to evaluate a filtering system? How should we set the coefficients in such a linear utility function?

Linear Utility = 3 #good - 2 #bad
#good (#bad): number of good (bad) documents delivered to user
It depends.

How can we extend a retrieval system to perform content-based information filtering?
• “Reuse” retrieval techniques to score documents
• Use a score threshold for filtering decision
• Learn to improve scoring with traditional feedback
• New approaches to threshold setting and learning
• Utility evaluation function to evaluate
What is the exploration-exploitation tradeoff?

The dilemma is between choosing what you know and getting something close to what you expect (‘exploitation’) and choosing something you aren’t sure about and possibly learning more (‘exploration’).

How does the beta-gamma threshold learning algorithm work?

Encourage exploration up to $\theta_{zero}$: $$\theta = \alpha \cdot \theta_{zero} + (1 - \alpha) \cdot \theta_{optimal}$$ The more examples,the less exploration(closer to optimal): $$\alpha = \beta + (1-\beta) \cdot e^{-N \cdot \gamma} , \beta, \gamma \in [0,1]$$ N =#training examples

What is the basic idea of collaborative filtering?

General idea
– Given a user u, find similar users {u1, …, um}
– Predict u’s preferences based on the preferences of u1, …, um
– User similarity can be judged based on their similarity in preferences on a common set of items

How does the memory-based collaborative filtering algorithm work?

General ideas:
– Xij: rating of object oj by user ui
– ni: average rating of all objects by user ui
–Normalized ratings: Vij = Xij – ni
– Prediction of rating of object oj by user ua $$\hat{v_{aj}} = k \sum_{i=1}^{m}{w(a,i)v_{ij}}, \hat{x_{aj}} = \hat{v_{aj}}+n_a, k = 1/\sum_{i=1}^{m}{w(a,i)}$$
– Specific approaches differ in w(a,i) – the distance/similarity between user ua and ui

What is the “cold start” problem in collaborative filtering?

little information about users at the beginning

Additional Readings and Resources

Key Phrases and Concepts

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Learning to rank, features, and logistic regression
Content-based filtering
Collaborative filtering
Beta-gamma threshold learning
Linear utility
User profile
Exploration-exploitation tradeoff
Memory-based collaborative filtering
Cold start

Week 6 Lessons

Lesson 6.1: Learning to Rank - Part 1 5 min

https://www.coursera.org/learn/text-retrieval/lecture/mFYTD/lesson-6-1-learning-to-rank-part-1
General idea of Learning to Rank

Lesson 6.2: Learning to Rank - Part 2 10 min

https://www.coursera.org/learn/text-retrieval/lecture/3d9fD/lesson-6-2-learning-to-rank-part-2
Regression-Based Approaches
Logistic Regression: Xi(Q,D) is feature; $\beta$’s are parameters
Estimate $\beta$’s by maximizing the likelihood of training data
Once $\beta$’s are known, we can take Xi(Q,D) computed based on a new query and a new document to generate a score for D w.r.t. Q

Lesson 6.3: Learning to Rank - Part 3 4 min

https://www.coursera.org/learn/text-retrieval/lecture/h3Jru/lesson-6-3-learning-to-rank-part-3
More Advanced Learning Algorithms

Lesson 6.4: Future of Web Search 13 min

https://www.coursera.org/learn/text-retrieval/lecture/kM78U/lesson-6-4-future-of-web-search
Next Generation Search Engines
The Data-User-Service (DUS) Triangle
Millions of Ways to Connect the DUS Triangle
Future Intelligent Information Systems

Lesson 6.5: Recommender Systems: Content-Based Filtering - Part 1 12 min

https://www.coursera.org/learn/text-retrieval/lecture/QORNe/lesson-6-5-recommender-systems-content-based-filtering-part-1
Two Modes of Text Access: Pull vs. Push
Recommender - Filtering System
Basic Filtering Question: Will User U Like Item X?
A Typical Content-Based Filtering System
Three Basic Problems in Content-Based Filtering
Extend a Retrieval System for Information Filtering
A General Vector-Space Approach

Lesson 6.6: Recommender Systems: Content-Based Filtering - Part 2 10 min

https://www.coursera.org/learn/text-retrieval/lecture/7M0GD/lesson-6-6-recommender-systems-content-based-filtering-part-2
Difficulties in Threshold Learning
Empirical Utility Optimization
Beta-Gamma Threshold Learning

Lesson 6.7: Recommender Systems: Collaborative Filtering - Part 1 6 min

https://www.coursera.org/learn/text-retrieval/lecture/cIFsU/lesson-6-7-recommender-systems-collaborative-filtering-part-1
What is Collaborative Filtering (CF)?
CF: Assumptions
The Collaboration Filtering Problem

Lesson 6.8: Recommender Systems: Collaborative Filtering - Part 2 12 min

https://www.coursera.org/learn/text-retrieval/lecture/awVwS/lesson-6-8-recommender-systems-collaborative-filtering-part-2
Memory-based Approaches
User Similarity Measures
Improving User Similarity Measures

Lesson 6.9: Recommender Systems: Collaborative Filtering - Part 3 4 min

https://www.coursera.org/learn/text-retrieval/lecture/tfXZ4/lesson-6-9-recommender-systems-collaborative-filtering-part-3
Filtering/Recommendation is “easy”
Filtering is “hard”
Content-based vs. Collaborative filtering vs. Hybrid
Recommendation can be combined with search :Push + Pull
Many advanced algorithms have been proposed to use more context information and advanced machine learning

Lesson 6.10: Course Summary 9 min

https://www.coursera.org/learn/text-retrieval/lecture/9CAed/lesson-6-10-course-summary

Week 6 Activities

Practice Quiz: Week 6 Practice Quiz 10 questions

PRACTICE QUIZ
Week 6 Practice Quiz
10 questions
To Pass70% or higher
Deadline
July 16, 11:59 PM PDT

1 point
1.In collaborative filtering, to measure user similarity using cosine, which one of the following is correct?

It solves problem of missing values.

It will not be biased by the user activity (less/more activity of users).

1 point
2.Which of the following tasks can be solved as a classification problem?

Recommendation

Ranking

Spamming filtering

1 point
3.In recommendation/filtering, which heuristic in the following is NOT correct?

If a user likes one item, then he/she will dislike other items not similar to the one liked.

Similar users will like the same items.

The same user will like similar items.

1 point
4.Comparing the logistic regression learning-to-rank method vs. the query likelihood model, which of the following is NOT true?

The query likelihood method does not require time for training.

The logistic regression method is better as it can take the query likelihood output as features.

If trained until converged, the logistic regression will output the same solution (probability) as the query likelihood model as they are using the same relevance judgement.

1 point
5.GFS is a parallel programming framework that allows parallelized construction of the inverted index.

False

True

1 point
6.After obtaining the chunk’s handle and locations from the GFS master, the GFS client (application) obtains the actual file data directly from one of the GFS chunkservers.

False

True

1 point
7.What is the advantage of Learning to Rank over BM25?

BM25 is much slower to train than Learning to Rank.

Learning to Rank can combine many more features than BM25.
1 point
8.Can the regression based approach Learning to Rank utilize multi-grade relevance judgement (for example, not very relevant, not relevant, mediocre, relevant, very relevant)?

Yes

1 point
9.Which is/are the reason(s) learning-based algorithms became popular in text retrieval?

There are more information needs such as removing spamming or diversified ranking.

More features are available now for determining the results.

More methods are available now for combining.

1 point
10.When a new user comes, which of the following will NOT help for recommendation?

Ask user to first select a few items that he likes

Ask user to provide a short description of himself

Recommend user with random selected items

5.Explanation: GFS is a distributed file system that aims at providing scalability and reliability in file storage.
6.Explanation: The actual data transfer happens directly between the client and the chunkservers. The master can only send and receive control signals (not data files).

Quiz: Week 6 Quiz 10 questions

QUIZ
Week 6 Quiz
10 questions
To Pass70% or higher
Attempts3 every 8 hours
Deadline
July 16, 11:59 PM PDT

1 point
1.When using the learning to rank framework for combining multiple features into a ranking function, training data composed of queries and relevance judgments is needed to learn the model parameters.

True

False

1 point
2.Information filtering systems are more suitable to help users satisfy long-term information needs than short-term ad hoc information needs.

False

True

1 point
3.In content-based filtering, an item is recommended to a user based on whether other “similar” users like the item or not.

True

False

1 point
4.In recommendation systems, one uses Beta-Gamma threshold learning for trade-off between exploration and exploitation: $θ=α*θ_{zero}+(1−α)*θ_{optimal}$. Which of the following is true?

α should be smaller for new users

α should the same for all users no matter if they are new

α should be larger for new users

1 point
5.Content-based filtering and collaborative filtering can be combined in a recommender system.

True

False

1 point
6.Which of the following is not an advantage of Learning to Rank?

Directly optimize retrieval measures

Is much faster than BM25 or language model when training

Combining multiple sources of features

1 point
7.Recommendation is one type of Pull mode of information access.

True

False

1 point
8.In Netflix, if a user has watched a lot of thriller movies, then it recommends “Inception” and “The Silence of the Lambs” to the user. What is this an example of?

This is content-based filtering.

This is collaborative filtering.

1 point
9.In Spotify, if a user has indicated himself/herself as youth, then Spotify recommends songs that are most listened by users under 20 years old. What is this an example of?

This is collaborative filtering.

This is content-based filtering.

1 point
10.When adding social network information into recommendation systems, such as friends’ info and friends’ liked items, this can be used to help:

Collaborative filtering

Content-based filtering

Optional Honors Content Honors Track Programming Assignment

Programming Assignment: Programming Assignment 23h

Text Mining and Analytics

Course can be found here
My certificate can be found here
Lecture slides can be found here

Pattern Discovery in Data Mining

Course can be found here
My certificate can be found here
Lecture slides can be found here

Cluster Analysis in Data Mining

Course can be found here
My certificate can be found here
Lecture slides can be found here

Data Mining Project

Course can be found here
My certificate can be found here
Lecture slides can be found here

Data Visualization

Week 1 Course Orientation

About the Course

Welcome to Data Visualization

Overview

Time

Activities

Goals and Objectives

Syllabus

Course Description

Course Goals and Objectives

Textbook and Readings

Course Outline

Elements of This Course

Information About Lectures

Discussion Forums

How to Pass the Course

Getting and Giving Help

About the Discussion Forums

Expectations

Helpful Tools

Improving Your Posts

Updating Your Profile

Social Media

More Information

Resources

Orientation Activities

Week 1 Information

Week 1 Overview

Overview

Time

Activities

Goals and Objectives

Guiding Questions

Key Phrases and Concepts

Tips for Success

Getting and Giving Help

Week 1 Introduction

How the Programming Assignments Work

How the Programming Assignments Work

Instructions for Week 1

Lesson 1-1: Introduction

1.1.1. Some Books on Data Visualization

1.1.2. Overview of Visualization

Lesson 1-2: Graphics, Drawing and Photorealism

1.2.1. 2-D Graphics

SVG-example

1.2.2. 2-D Drawing

1.2.3. 3-D Graphics

1.2.4. Photorealism

1.2.5. Non-Photorealism

Lesson 1-3: Humans and Visualizations

1.3.1. The Human

1.3.2. Memory

1.3.3. Reasoning

1.3.4. The Human Retina

1.3.5. Perceiving Two Dimensions

1.3.6. Perceiving Perspective

Week 1 Activities

Quiz with 10/10 points (100%)

Discussion Prompt

Overview

Time Estimate

Requirements

Submission

Week 2: Visualization of Numerical Data

Week 2 Information

Week 2 Overview 10 min

Overview

Time

Activities

Goals and Objectives

Guiding Questions

Additional Readings and Resources

Key Phrases and Concepts

Tips for Success

Getting and Giving Help

Week 2 Introduction 2 min

Lesson 2-1: Data, Mapping, and Charts

2.1.1. Data 7 min